[PDF] Berry--Esseen Bounds for Multivariate Nonlinear Statistics with Applications to M-estimators and Stochastic Gradient Descent Algorithms

Abstract

We establish a Berry--Esseen bound for general multivariate nonlinear statistics by developing a new multivariate-type randomized concentration inequality. The bound is the best possible for many known statistics. As applications, Berry--Esseen bounds for M-estimators and averaged stochastic gradient descent algorithms are obtained.

Full PDF

aa r X i v : . [ m a t h . P R ] F e b Berry–Esseen Bounds for MultivariateNonlinear Statistics with Applications toM-estimators and Stochastic GradientDescent Algorithms

Qi-Man Shao ∗1,2 and Zhuo-Song Zhang †2,3 Department of Statistics and Data Scinece, Southern University of Science andTechnology, Shenzhen, Guangdong, P.R. China. e-mail: [email protected] Department of Statistics, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong Department of Statistics and Applied Probability, National University of Singapore,Singapore 117546. e-mail: [email protected]

Abstract:

We establish a Berry–Esseen bound for general multivari-ate nonlinear statistics by developing a new multivariate-type randomizedconcentration inequality. The bound is the best possible for many knownstatistics. As applications, Berry–Esseen bounds for M-estimators and av-eraged stochastic gradient descent algorithms are obtained.

MSC2020 subject classiﬁcations:

Primary 60F05, 62E20; secondary 62F12.

Keywords and phrases:

Berry–Esseen bound, Multivariate normal ap-proximation, Randomized concentration inequality, Stein’s method, M-estimators,Averaged stochastic gradient descent algorithms.

1. Introduction

Let X , . . . , X n be independent random variables taking values on X and let T := T ( X , . . . , X n ) be a general d -dimensional nonlinear statistic. In manycases the nonlinear statistic can be written as a linear statistic plus an errorterm: T = W + D, (1.1)where W = n X i =1 ξ i , D := D ( X , . . . , X n ) = T − W, (1.2) ξ i := h i ( X i ) ∈ R d and h i : X 7→ R d is a Borel measurable function. Assume that E ξ i = 0 for each i n and n X i =1 E { ξ i ξ ⊺ i } = I d . (1.3) ∗ Research partially supported by NSFC12031005 and Shenzhen Outstanding Talents Train-ing Fund and also by Hong Kong RGC GRF 14302515 and 14304917. † Corresponding author. Research partially supported by Singapore Ministry of EducationAcademic Research Fund MOE 2018-T2-076. 1 .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Let γ := γ n = n X i =1 E k ξ i k . (1.4)Since ξ i is standardized, we remark that h i = h n,i and ξ i = ξ n,i . If k D k p → and γ → as n → ∞ , then, clearly, T converges in distribution to a d -dimensionalstandard normal distribution N (0 , I d ) .The aim of this paper is to provide a Berry–Esseen bound of the multivariatenormal approximation for the nonlinear statistic T . The Berry–Esseen bound formultivariate normal approximation has been well studied in the past decades.For the linear statistic W , Bentkus [4, 5] used induction and Taylor’s expansionto prove a Berry–Esseen bound of order d / γ , which is the best known result forthe dependence on the dimension d . We refer to Nagaev [16], Senatov [28], Götze[14], Bhattacharya and Holmes [7] and Raič [25] for other results for independentrandom vectors.In the case where d = 1 , Chen and Shao [9] proved a Berry–Esseen bound for T using the Berry–Esseen bound for W and a randomized-type concentrationinequality approach: sup z ∈ R | P ( T z ) − Φ( z ) | . γ + E | W D | + n X i =1 E | ξ i ( D − D ( i ) ) | , (1.5)where D ( i ) is any random variable such that ξ i is independent of D ( i ) and Φ is the standard normal distribution function. For the Berry–Esseen bound formultivariate normal approximation, Chen and Fang [10] proved a concentrationinequality for d -dimensional exchangeable pairs. We also refer to Barbour [3],Götze [14], Goldstein and Rinott [13], Chatterjee and Meckes [8], Reinert andRöllin [26], Bhattacharya and Holmes [7], Chen et al. [11], Chen and Fang [10]and Raič [25] for the development of Stein’s method for multivariate normalapproximations.The main purpose of this paper is to prove a Berry–Esseen bound for nonlin-ear multivariate statistics by developing a new randomized multivariate concen-tration inequality which generalizes the results of Chen and Shao [9] and Chenand Fang [10]. Our main result can be applied to a large class of non-linearstatistics, including M-estimators and averaged stochastic gradient descent es-timators.Throughout this paper, we use the following notations. Let d > and x =( x , . . . , x d ) be a vector in R d . For x, y ∈ R d , denote by h x, y i the inner productof x and y . Let k x k = p h x, x i be the l -norm of x . For a d × d matrix A ,and let λ min ( A ) and λ max ( A ) be the minimal and maximal eigenvalue of A ,respectively. Denote by A ⊺ the transpose of A and by k A k the spectral norm,i.e., k A k := ( λ max ( A ⊺ A )) / . Let I d be the d -dimensional identity matrix. For X ∈ R (resp. R d ) and p > , let k X k p = ( E {| X | p } ) /p (resp. ( E {k X k p } ) /p ) bethe L p -norm of X . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics The rest of this paper is organized as follows. In Section 2, we present theBerry–Esseen bound of the multivariate normal approximation for T . In Section 3,we apply our main result to M-estimators and averaged stochastic gradient de-scent algorithms. In Section 4, we present a randomized concentration inequalityfor multivariate linear statistics and give the proof of the main result. The proofsof theorems in Section 3 are postponed to Section 5.

2. Main results

Let ( X , . . . , X n ) , ( ξ , . . . , ξ n ) , W, T and D be deﬁned as in (1.1) and (1.2). Let A be the collection of all convex sets in R d . Let Z ∼ N (0 , I d ) . The followingtheorem provides a Berry–Esseen bound for T . Theorem 2.1.

Assume that (1.3) is satisﬁed. Then, sup A ∈A (cid:12)(cid:12) P ( T ∈ A ) − P ( Z ∈ A ) (cid:12)(cid:12) d / γ + 2 E (cid:8) k W k ∆ (cid:9) + 2 n X i =1 E (cid:8) k ξ i k| ∆ − ∆ ( i ) | (cid:9) , (2.1) for any random variables ∆ and (∆ ( i ) ) i n such that ∆ > k D k and ∆ ( i ) isindependent of X i , where γ is as deﬁned in (1.4) . Remark 2.1.

The choices of ∆ and ∆ ( i ) are ﬂexible. For example, let ( X ′ , . . . , X ′ n ) be an independent copy of ( X , . . . , X n ) , one may choose ∆ = k D k and ∆ ( i ) = k D ( i ) k , where D ( i ) = D ( X , . . . , X i − , X ′ i , X i +1 , . . . , X n ) . One can also choose D ( i ) = D ( X , . . . , X i − , , X i +1 , . . . , X n ) . Moreover, the last term in (2.1) can-not be removed, and we refer to Chen and Shao [9, Section 4] for a counterex-ample. Remark 2.2.

For d = 1 , the right hand side of (2.1) reduces to γ + 2 E {| W | ∆ } + 2 n X i =1 E (cid:12)(cid:12) ξ i (∆ − ∆ ( i ) ) (cid:12)(cid:12) , which diﬀers from (1.5) up to a constant factor.The Berry–Esseen bound (2.1) provides an optimal order in terms of n formany applications. However, the order in d may not be optimal in (2.1). For alinear statistic W , Bentkus [5] proved that sup A ∈A (cid:12)(cid:12) P ( W ∈ A ) − P ( Z ∈ A ) (cid:12)(cid:12) Cd / γ, where C > is an absolute constant and d / is believed to be the best possible.Here, C > is an absolute constant, and Raič [25] recently obtained a boundwith an explicit constant d / + 16 by using Stein’s method. However, it isnot clear how to obtain the order d / in our result.Using the technique of truncation, we obtain the following corollary, whichmay be useful for applications. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Corollary 2.2.

Let O be a measurable set and ∆ be a random variable suchthat ∆ > k D k ( O ) . Under the conditions of Theorem , we have sup A ∈A (cid:12)(cid:12) P ( T ∈ A ) − P ( Z ∈ A ) (cid:12)(cid:12) d / γ + 2 E (cid:8) k W k ∆ (cid:9) + 2 n X i =1 E (cid:8) k ξ i k| ∆ − ∆ ( i ) | (cid:9) + P ( O c ) , where ∆ ( i ) is any measurable random variable that is independent of X i . Condition (1.3) can be extended to a general case. We have the followingcorollary.

Corollary 2.3.

Let

T, W, D and ( ξ , . . . , ξ n ) be deﬁned as in (1.1) and (1.2) .Assume that ( ξ , . . . , ξ n ) satisﬁes: E { ξ i } = 0 for i n and n X i =1 E { ξ i ξ ⊺ i } = Σ , where Σ is a positive deﬁnite matrix with λ min (Σ) > σ > . Then sup A ∈A (cid:12)(cid:12) P ( T ∈ A ) − P (Σ / Z ∈ A ) (cid:12)(cid:12) σ − / d / γ + 2 σ − E (cid:8) k W k ∆ (cid:9) + 2 σ − n X i =1 E (cid:8) k ξ i k| ∆ − ∆ ( i ) | (cid:9) , for any random variables ∆ and (∆ ( i ) ) i n such that ∆ > k D k and ∆ ( i ) isindependent of X i , where γ is as deﬁned in (1.4) .

3. Applications

In this section, we apply Theorem 2.1 to M-estimators and stochastic gradientdescent algorithms.

Let

X, X , . . . , X n be i.i.d. random variables with common probability distri-bution P that take values in a measurable space ( X , B ( X )) . For any function f : X 7→ R , let P n f = 1 n n X i =1 f ( X i ) , P f = Z X f ( x ) P ( dx ) , G n f = √ n ( P n − P ) f. (3.1)Let Θ ⊂ R d be a parameter space. For each θ ∈ Θ , let m θ ( · ) : X 7→ R be twicediﬀerentiable with respect to θ , and write M n ( θ ) = P n m θ , M ( θ ) = P m θ . (3.2) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Following the notations in Van der Vaart [31], we brieﬂy write ˙ m θ ( x ) = ∇ θ m θ ( x ) , ¨ m θ ( x ) = ∇ θ m θ ( x ) , (3.3)where ∇ θ m θ ( x ) is the gradient with respect to θ . Let θ ∗ = arg min θ ∈ Θ M ( θ ) (3.4)and we say ˆ θ n is an M-estimator of θ ∗ if ˆ θ n = arg min θ ∈ Θ M n ( θ ) . (3.5)For any p > and Y ∈ R d , let k Y k p = ( E {k Y k p } ) /p be the L p -norm of Y .The asymptotic properties for M-estimators have been well studied in theliterature, and we refer to Van der Vaart and Wellner [32], Van der Vaart [31]and the references therein for a thorough reference. Under some regularity con-ditions, one has ˆ θ n p → θ ∗ , and Pollard [22] showed that √ n (ˆ θ n − θ ∗ ) convergesweakly to a d -dimensional normal distribution. The convergence rate was alsostudied by many authors, for instance, Pfanzagl [19, 20, 21] proved a Berry–Esseen bound of order O ( n − / ) for the minimum contrast estimates undersome regularity conditions.In this subsection, we provide a Berry–Esseen bound for √ n (ˆ θ n − θ ∗ ) undersome convexity conditions, which are diﬀerent from those in Pfanzagl [20]. Forsymmetric matrices A and B , denote by A ( resp. < ) B if A − B is non-positive(resp. non-negative) deﬁnite. We ﬁrst propose the following two assumptions.(M1) The function m θ ( · ) is twice diﬀerentiable with respect to θ and there existconstants µ > , c > , c > and two nonnegative functions m , m : X 7→ R with k m ( X ) k c and k m ( X ) k c , such that for any θ ∈ Θ , M ( θ ) − M ( θ ∗ ) > µ k θ − θ ∗ k , (3.6) | m θ ( x ) − m θ ∗ ( x ) | m ( x ) k θ − θ ∗ k , ∀ x ∈ X , (3.7)and k ¨ m θ ( x ) − ¨ m θ ∗ ( x ) k m ( x ) k θ − θ ∗ k , ∀ x ∈ X . (3.8)Moreover, there exists a constant c > and a nonnegative function m : X 7→ R such that for any x ∈ X , ¨ m θ ∗ ( x ) m ( x ) I d and k m ( X ) k c . (3.9)(M2) Let ξ i = ˙ m θ ∗ ( X i ) := ( ξ i, , . . . , ξ i,d ) ⊺ , Σ = E { ξ i ξ ⊺ i } and V = E { ¨ m θ ∗ ( X ) } .Assume that there exist constants λ > and λ > such that λ min (Σ) > λ and λ min ( V ) > λ . Moreover, assume that there exists a constant c > such that k ξ k c d / . (3.10) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics The following theorem provides a Berry–Esseen bound for the M-estimators.

Theorem 3.1.

Let θ ∗ and ˆ θ n be deﬁned as in (3.4) and (3.5) . Under the con-ditions (M1) and (M2) , we have sup A ∈A (cid:12)(cid:12)(cid:12) P (cid:0) n / Σ − / V (ˆ θ n − θ ∗ ) ∈ A (cid:1) − P ( Z ∈ A ) (cid:12)(cid:12)(cid:12) Cd / n − / , where C > is a constant depending only on c , c , c , c , µ, λ and λ . Remark 3.1.

The assumptions (M1) and (M2) are neater than those in Pfan-zagl [20]. Moreover, Theorem 3.1 provides a Berry–Esseen bound with the de-pendence on the dimension.

Remark 3.2.

Based on the proof of Theorem 3.1, if we further assume that | m ( X i ) | c for each i n almost surely, then the assumption for m ( x ) can be replaced by k m ( X ) k c . The condition (3.10) is satisﬁed if k ξ ij k c for all i n and j d . Remark 3.3.

The twice diﬀerentiability of m θ ( x ) holds for many applications.However, in general, ¨ m θ ( x ) does not necessarily exist. We will discuss this casein the next subsection.When m θ ( · ) is smooth in θ , one can compute ˆ θ n by solving the score equation P n ˙ m θ = 1 n n X i =1 ˙ m θ ( X i ) = 0 . More generally, we can consider the estimating equations of the following type.Let Θ ⊂ R d be the parameter space and for each θ ∈ Θ , let h θ : X 7→ R d , andlet Ψ n ( θ ) = 1 n n X i =1 h θ ( X i ) , Ψ( θ ) = E { h θ ( X ) } . Let ˆ θ n and θ ∗ satisfy Ψ n (ˆ θ n ) = 0 , Ψ( θ ∗ ) = 0 . (3.11)The estimator ˆ θ n in (3.11) is often called a Z-estimator of θ ∗ , see e.g., Van derVaart [31]. However, although there is no maximization in (3.11), the estimator ˆ θ n is also called an M-estimator of θ ∗ . Assume that Ψ( θ ) is diﬀerentiable at θ ∗ and there exists a d × d matrix ˙Ψ satisfying Ψ( θ ) − Ψ( θ ∗ ) − ˙Ψ ( θ − θ ∗ ) = o( k θ − θ ∗ k ) as θ → θ ∗ .Under some regularity conditions and the so called “asymptotic equi-continuity”condition, Huber [15] proved that √ n (ˆ θ n − θ ∗ ) converges in distribution to ˙Ψ − Z ,where Z ∼ N (0 , E { h θ ∗ ( X i ) h θ ∗ ( X i ) ⊺ } ) . Bentkus, Bloznelis and Götze [6] proveda Berry–Esseen bound of order O( n − / ) for the -dimensional case under someconvexity conditions, and Paulauskas [18] proved a convergence rate result for .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics the d -dimensional case under some smooth stochastic diﬀerentiability conditions,which are diﬀerent from the conditions (B1)–(B5) below.Let p > be a ﬁxed number, and we make the following assumptions.(B1) There exist positive constants µ, c and λ and a positive deﬁnite matrix ˙Ψ such that (cid:10) Ψ( θ ) − Ψ( θ ) , θ − θ (cid:11) > µ k θ − θ k , (3.12)and k Ψ( θ ) − Ψ( θ ∗ ) − ˙Ψ ( θ − θ ∗ ) k c k θ − θ ∗ k , λ min ( ˙Ψ ) > λ . (3.13)(B2) Let h θ,j be the j -th element of h θ . There exists a function h : X 7→ R + and a constant c > such that for any θ, θ ′ ∈ Θ , (cid:12)(cid:12) h θ,j ( X ) − h θ ′ ,j ( X ) (cid:12)(cid:12) h ( X ) k θ − θ ′ k . (3.14)and k h ( X ) k p c . (3.15)(B3) Let ξ i = h θ ∗ ( X i ) and Σ = E { ξ i ξ ⊺ i } . Assume that there exist positiveconstants c and λ such that λ min (Σ) > λ , (3.16)and k ξ k p c d / . (3.17) Remark 3.4.

Following notations in Theorem 3.1, we can choose h θ ( x ) =˙ m θ ( x ) . Note that the assumption (B1) is weaker than (M1) in the sense of thediﬀerentiability of h θ , because we assume that the diﬀerentiability only holdsfor Ψ( θ ) rather than h θ ( x ) . Theorem 3.2.

Let ˆ θ n and θ ∗ be deﬁned as in (3.11) . Let p > and D Θ :=sup θ ,θ ∈ Θ k θ − θ k , the diameter of the parameter space Θ . Assume that con-ditions (B1)–(B3) are satisﬁed. Then, sup A ∈A (cid:12)(cid:12) P (cid:0) √ n Σ − / ˙Ψ (ˆ θ n − θ ∗ ) ∈ A (cid:1) − P (cid:0) Z ∈ A (cid:1)(cid:12)(cid:12) C ( D Θ + 1) d / n − / ε p . (3.18) where ε p = 1 / (2 p − and C > is a constant depending on p, c , c , c , λ , λ and µ . Remark 3.5.

Under some diﬀerent conditions and assuming that E k ξ i k isbounded, Paulauskas [18, Theorem 9] proved a bound of order n − / (log n ) / .In Theorem 3.2 with p = 3 , the result (3.18) reduces to D d / n − / , whichis of a sharper order than Paulauskas [18]. Moreover, Theorem 3.2 provides aresult with the dependence on the dimension d . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics The order n − / ε p can be improved to n − / log n under some stronger con-ditions. Let us introduce the so-called Orlicz norm , one may refer to Van derVaart and Wellner [32, Section 2.2] for more details. Let ψ : [0 , ∞ ) [0 , ∞ ) bea nondecreasing, convex function with ψ (0) = 0 . Let Y be a R d -valued randomvariable, and deﬁne the Orlicz norm of Y with respect to ψ to be k Y k ψ = inf n C > E n ψ (cid:16) k Y k C (cid:17)o o . (3.19)Specially, if we choose ψ ( x ) = x p for p > , then the corresponding Orlicznorm is simply the L p -norm. Let ψ ( x ) := e x − . Now we propose the followingassumptions.(B4) The condition (3.15) in (B2) is replaced by k h ( X ) k ψ c . (3.20)where c > is a constant.(B5) The condition (3.17) in (B3) is replaced by k ξ k ψ c , (3.21)where c > is a constant. Remark 3.6.

Let Y be a random variable. It can be shown (see Vershynin[33, (5.14)–(5.16)] for example) that, there exist positive constants K , K , K that diﬀer from each other by at most an absolute constant factor such that thefollowing are equivalent:(a) k Y k ψ K ;(b) P ( | Y | > t ) exp { − t/K } for all t > ;(c) k Y k p K p for all p > .We have the following theorem. Theorem 3.3.

Let ˆ θ n , θ ∗ and D Θ be deﬁned as in Theorem . Under theassumptions (B1) , (B4) and (B5) , sup A ∈A (cid:12)(cid:12) P (cid:0) √ n Σ − / ˙Ψ (ˆ θ n − θ ∗ ) ∈ A (cid:1) − P (cid:0) Z ∈ A (cid:1)(cid:12)(cid:12) C ( D Θ + 1) d n − / log n, where C > is a constant depending on c , c , c , λ , λ and µ . Consider the problem of searching for the minimum point θ ∗ of a smooth func-tion f ( θ ) , θ ∈ Θ ⊂ R d . The stochastic gradient descent method provides a directway to solve the minimization problem. In this subsection, we consider the av-eraged stochastic gradient descent algorithm, which is proposed by Polyak [23] .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics and Ruppert [27]. The algorithm is given as follows: Let θ ∈ R d be the initialvalue (might be random), and for n > , we update θ n by θ n = θ n − − ℓ n (cid:0) ∇ f ( θ n − ) + ζ n (cid:1) , ¯ θ n = 1 n n − X i =0 θ i . (3.22)where ℓ n > is the so called learning rate and ( ζ , ζ , . . . ) is a sequence of R d -valued martingale diﬀerences. The convergence rate of E k θ n − θ ∗ k and E k ¯ θ n − θ ∗ k was thoroughly studied in the literature, see Polyak [23] andBach and Moulines [2]. The normality of √ n (¯ θ n − θ ∗ ) is also well-known, seePolyak and Juditsky [24]. Suppose that the learning rate ℓ n = ℓ n − α where α ∈ (1 / , , under some regularity conditions, Polyak and Juditsky [24] provedthat √ n (¯ θ n − θ ∗ ) converges weakly to a multivariate normal distribution. Re-cently, Anastasiou, Balasubramanian and Erdogdu [1] used Stein’s method andthe techniques of martingales to prove a convergence rate for a class of smoothtest functions, see Anastasiou, Balasubramanian and Erdogdu [1, Theorem 4]for more details.In this subsection, we provide a Berry–Esseen bound for the normal approx-imation for √ n (¯ θ n − θ ∗ ) .We make the following assumptions:(C0) There exists a constant τ > such that k θ − θ ∗ k τ . (C1) The sequence ( ζ , ζ , . . . ) is independent of θ , and for each n > , ζ n admits the decomposition ζ n = ξ n + η n , where(i). ( ξ , ξ , . . . ) is a sequence of independent random variables and E { ξ i } =0 and E (cid:8) ξ i ξ ⊺ i (cid:9) = Σ i ; there exist positive numbers λ and λ such thatfor any i > , λ λ min (Σ i ) λ max (Σ i ) λ ; moreover, there existsa positive number τ such that max i n k ξ i k τ ; (ii). let F = σ { θ } , and for each n > , F n = σ { θ , ξ , . . . , ξ n } ; let g ( · , · ) : R d × R d R d , and the random variable η n := g ( θ n − , ξ n ) satisﬁes E { η n |F n − } = 0 and for any θ and θ ′ , there exists a nonnegativenumber c > such that k g ( θ, ξ ) − g ( θ ′ , ξ ) k c k θ − θ ′ k and g ( θ ∗ , ξ ) = 0 for ξ ∈ R d . (3.23)(C2) The function f is L -smooth and strongly convex with convexity constant µ > , i.e., f is twice diﬀerentiable and there exist two constants µ > and L > such that µI d ∇ f ( θ ) LI d , for all θ ∈ Θ . (3.24) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics (C3) There exist positive constants c and β such that for all θ with k θ − θ ∗ k β , (cid:13)(cid:13) ∇ f ( θ ) − ∇ f ( θ ∗ ) (cid:13)(cid:13) c k θ − θ ∗ k . (3.25)Let G := ∇ f ( θ ∗ ) . Recall that ( ℓ n ) n > is the learning rate sequence in (3.22),and let Q i = ℓ i n − Y j = i j Y k = i +1 ( I d − ℓ k G ) . Here, for any n > , set Q ni = n +1 A i = I d , Q ni = n +1 a i = 1 , where ( A i ) i > is a R d × d -valued sequence and ( a i ) i > is a R -valued sequence. Let Σ n = 1 n n − X i =1 Q i Σ i Q ⊺ i . We have the following theorem.

Theorem 3.4.

Let ℓ n = ℓ n − α where ℓ > and / < α . Under theassumptions (C0)–(C3) , we have (1) if α ∈ (1 / , , sup A ∈A (cid:12)(cid:12)(cid:12) P (cid:0) √ n Σ − / n (¯ θ n − θ ∗ ) ∈ A (cid:1) − P ( Z ∈ A ) (cid:12)(cid:12)(cid:12) C (cid:0) d / + τ + τ (cid:1) ( d / n − / + n − α +1 / ); (3.26)(2) if ℓ n = ℓ n − with ℓ µ > , we have sup A ∈A (cid:12)(cid:12)(cid:12) P (cid:0) √ n Σ − / n (¯ θ n − θ ∗ ) ∈ A (cid:1) − P ( Z ∈ A ) (cid:12)(cid:12)(cid:12) Cn − / ( d / + τ + τ ) × ( d / + log n, ℓ µ > d / (log n ) , ℓ µ = 1 . (3.27) Here,

C > is a constant depending only on ℓ , λ , λ , c , c , α, β, L and µ andindependent of d, τ and τ . Remark 3.7.

Typically, τ ∼ τ ∼ d / . Specially, if α = 1 − ε with an arbitrary < ε < / , then the RHS of (3.26) reduces to C ( d n − / + d / n − / ε ) . If α = 1 with ℓ µ > , the Berry–Esseen bound (3.27) is of an optimal order upto a polynomial of a (log n ) factor. Remark 3.8.

For α = 1 , it has been proved (see Bach and Moulines [2, Theorem2] and also Lemma 5.12) that E k θ n − θ ∗ k  n − , ℓ µ > n − (log n ) , ℓ µ = 1; n − ℓ µ/ , < ℓ µ < . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Therefore, for α = 1 , the choice of ℓ is critical, but the problem is: a small ℓ leads to a very slow convergence rate of order n − ℓ µ/ while a large ℓ mightlead to explosion due to the initial condition (see, e.g., Bach and Moulines [2]and Nemirovski, Juditsky, Lan and Shapiro [17] for more details). In practice,one prefers to use a learning rate of order n − α with < α < . Theorem 3.5.

Consider the model (3.2) . Let θ ∗ = min θ ∈ R d M ( θ ) , and the algorithm θ n = θ n − − ℓ n ˙ m θ n − ( X n ) , where ˙ m θ is as in (3.3) , ℓ n = ℓ n − α is the learning rate, ℓ > , / < α and θ is the initial value that is independent of ( X , . . . , X n ) . Let ξ n = ˙ m θ ∗ ( X n ) − ∇ M ( θ ∗ ) ,η n = ˙ m θ n − ( X n ) − ˙ m θ ∗ ( X n ) − ∇ M ( θ n − ) + ∇ M ( θ ∗ ) . Assume that (C1(i)) is satisﬁed for ( ξ , . . . , ξ n ) and for any θ , θ ∈ R d , sup z ∈X k ˙ m θ ( z ) − ˙ m θ ( z ) k L F k θ − θ k . (3.28) Assume further that (C0) , (C2) and (C3) are satisﬁed with f ( θ ) = M ( θ ) , and let ¯ θ n be as deﬁned in (3.22) . Then, we have (3.26) and (3.27) hold with c = 2 L F .Proof. We only need to check the condition (C1(ii)) is satisﬁed. Note that foreach n > , ˙ m θ n − ( X n ) = ∇ M ( θ n − ) + (cid:0) ˙ m θ n − ( X n ) − ∇ M ( θ n − ) (cid:1) = ∇ M ( θ n − ) + (cid:0) ˙ m θ ∗ ( X n ) − ∇ M ( θ ∗ ) (cid:1) + (cid:0) ˙ m θ n − ( X n ) − ˙ m θ ∗ ( X n ) − ∇ M ( θ n − ) + ∇ M ( θ ∗ ) (cid:1) = ∇ M ( θ n − ) + ξ n + η n . For n > , let F n = σ ( θ , X , . . . , X n ) . Then we have E { ξ n } = 0 and E (cid:8) η n (cid:12)(cid:12) F n − (cid:9) = 0 . By (3.28), it follows that the condition (C1(ii)) in (3.6) holds with c = 2 L F . Hence, Theorem 3.4 implies the desired result.

4. Proofs of main results

To prove (2.1), we need to develop a randomized concentration inequality forsums of multivariate independent random vectors. We use the following notation.For a subset A of R d , let d ( x, A ) = inf {k x − y k : y ∈ A } . For a given number ε > , deﬁne A ε = (cid:8) x ∈ R d : d ( x, A ) ε (cid:9) , and A − ε = { x ∈ A : B ( x, ε ) ⊂ A } , .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics where B ( x, ε ) is the d -dimensional ball centered in x with radius ε . Specially, for ε = 0 , let A ε = A . Let ¯ A be the closure of A and let r ( ¯ A ) = max { y : B ( x, y ) ⊂ ¯ A for some x ∈ R n } be the inradius of ¯ A . For a, b ∈ R , write a ∧ b = min( a, b ) and a ∨ b = max( a, b ) . Let γ = P ni =1 E {k ξ i k } be as in (1.4). We have thefollowing proposition. Proposition 4.1.

Let W = P ni =1 ξ i , where ( ξ i ) ni =1 is a sequence of R d -valuedindependent random vectors satisfying that E { ξ i } = 0 for i n and P ni =1 E { ξ i ξ ⊺ i } = I d . Let ∆ and ∆ be nonnegative random variables. Thenwe have for all A ∈ A such that r ( ¯ A ) > γ , P (cid:0) W ∈ A γ +∆ \ A γ − ¯∆ (cid:1) d / γ + 2 E (cid:8) k W k (∆ + ∆ ) (cid:9) + 2 n X i =1 2 X j =1 E (cid:8) k ξ i k| ∆ j − ∆ ( i ) j | (cid:9) , (4.1) where ¯∆ = ∆ ∧ ( r ( ¯ A ) − γ ) and ∆ ( i ) is a random variable independent of ξ i . The proof of this proposition is postponed in Subsection 4.3.

Remark 4.1.

Specially, if ∆ = ε and ∆ = 0 where ε > is a constant, then(4.1) reduces to P ( W ∈ A γ + ε \ A γ ) d / ε + 19 d / γ, which is equivalent to the result in Chen and Fang [10] up to a constant factor. Remark 4.2.

When d = 1 , the right hand side of (4.1) reduces to γ + 2 E | W (∆ + ∆ ) | + 2 n X i =1 2 X j =1 E | ξ i (∆ j − ∆ ( i ) j ) | , which is equivalent to Chen and Shao [9]’s concentration inequality result. Re-cently, Shao and Zhou [29] proved that the term E | W ∆ | can be improved tobe E | ∆ | in (1.5). However, due to some technical diﬃculty, we are not able toremove the W term in our result. Nevertheless, the order in n is optimal inmany applications. and Corollaries and We ﬁrst give the proof of Theorem 2.1.

Proof of Theorem . Without loss of generality, let A be an arbitrary nonemptyconvex subset of R d . Let Z ∼ N (0 , I d ) be independent of all others. It hasbeen shown in Chen and Fang [10, Proposition 2.5 and Theorem 3.5] that for ε , ε > , sup A ∈A (cid:12)(cid:12) P ( W ∈ A ) − P ( Z ∈ A ) (cid:12)(cid:12) d / γ, (4.2) P ( Z ∈ A ε \ A − ε ) d / ( ε + ε ) . (4.3) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics For each i n , let ∆ ( i ) be any random variable that is independent of ξ i .Note that k T − W k ∆ and that r ( ¯ A γ ) > γ . Applying Proposition 4.1 to A γ with ∆ = ∆ and ∆ = 0 , and by (4.2) and (4.3), we have P ( T ∈ A ) − P ( Z ∈ A ) P ( T ∈ A γ ) − P ( W ∈ A γ ) + P ( W ∈ A γ ) − P ( Z ∈ A γ ) + P ( Z ∈ A γ \ A ) P ( W ∈ ( A γ ) ∆+4 γ \ ( A γ ) γ ) + 121 d / γ d / γ + 2 E (cid:8) k W k ∆ (cid:9) + 2 n X i =1 E (cid:8) k X i k| ∆ − ∆ ( i ) | (cid:9) . (4.4)This proves the upper bound of P ( T ∈ A ) − P ( Z ∈ A ) . For the upper boundof P ( Z ∈ A ) − P ( T ∈ A ) , we introduce the following notation. Recall that ¯ A isthe closure of A and r := r ( ¯ A ) is the inradius of ¯ A . We consider the followingtwo cases.If r < γ , then A − γ = ∅ . By (4.3), P ( Z ∈ A ) − P ( T ∈ A ) P ( Z ∈ A \ A − γ ) d / γ. (4.5)Now we consider the case where r > γ . Let A = A − γ and it follows that A = ∅ and r ( ¯ A ) = r − γ . Let ∆ = ∆ ∧ ( r − γ ) = ∆ ∧ ( r ( ¯ A ) − γ ) . Since A γ = ( A − γ ) γ ⊂ A , we have P (cid:0) Z ∈ A (cid:1) − P (cid:0) T ∈ A (cid:1) P (cid:0) Z ∈ A (cid:1) − P (cid:0) T ∈ A γ (cid:1) = Q + Q + Q where Q = P (cid:0) Z ∈ A (cid:1) − P (cid:0) Z ∈ A γ (cid:1) ,Q = P (cid:0) Z ∈ A γ (cid:1) − P (cid:0) W ∈ A γ (cid:1) ,Q = P (cid:0) W ∈ A γ (cid:1) − P (cid:0) T ∈ A γ (cid:1) . For Q , by (4.3), we have | Q | P (cid:0) Z ∈ A \ A (cid:1) d / γ. For Q , noting that A γ is also convex, by (4.2), we have | Q | d / γ. We now move to give an upper bound of Q . If ∆ r − γ , { w ∈ A γ } − { w + D ∈ A γ } { w ∈ A γ \ A γ − ∆0 } . (4.6)If ∆ > r − γ , then { w ∈ A γ } − { w + D ∈ A γ } { w ∈ A γ } { w ∈ A γ \ A γ − r } + { w ∈ A γ − r } { w ∈ A γ \ A γ − ( r − γ )0 } + { w ∈ A γ − r } , (4.7) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics where the last line follows from the fact that ( A − γ ) γ − r ⊂ A γ − r . Equations(4.6) and (4.7) yield Q = P (cid:0) W ∈ A γ (cid:1) − P (cid:0) W + D ∈ A γ (cid:1) P (cid:0) W ∈ A γ \ A γ − ∆ (cid:1) + P (cid:0) W ∈ A γ − r (cid:1) . (4.8)For each i n , let ∆ ( i )0 = ∆ ( i ) ∧ ( r ( ¯ A ) − γ ) . For the ﬁrst term of (4.8), byProposition 4.1, we have P (cid:0) W ∈ A γ \ A γ − ∆ (cid:1) d / γ + 2 E {k W k ∆ } + 2 n X i =1 E {k ξ i k| ∆ − ∆ ( i )0 |} d / γ + 2 E {k W k ∆ } + 2 n X i =1 E {k ξ i k| ∆ − ∆ ( i ) |} . For the second term of (4.8), since A − r − γ = ∅ and A γ − r is convex andnonempty, by (4.2) and (4.3), we have P ( W ∈ A γ − r ) | P ( W ∈ A γ − r ) − P ( Z ∈ A γ − r ) | + P ( Z ∈ A γ − r \ A − γ − r ) d / γ + 6 d / γ d / γ. Then it follows that Q d / γ + 2 E {k W k ∆ } + 2 n X i =1 E {k ξ i k| ∆ − ∆ ( i ) |} . Combining the upper bounds of Q , Q and Q , we have P ( Z ∈ A ) − P ( T ∈ A ) d / γ + 2 E (cid:8) k W k ∆ (cid:9) + 2 n X i =1 E (cid:8) k X i k| ∆ − ∆ ( i ) | (cid:9) . (4.9)By (4.4), (4.5) and (4.9), we have sup A ∈A | P ( T ∈ A ) − P ( W ∈ A ) | d / γ + 2 E {k W k ∆ } + 2 n X i =1 E {k X i k| ∆ − ∆ ( i ) |} , as desired. Proof of Corollary . Let e T = W + D ( O ) . For any A ∈ A , | P ( T ∈ A ) − P ( e T ∈ A ) | P ( O c ) . Applying Theorem 2.1 to e T yields sup A ∈A (cid:12)(cid:12) P ( e T ∈ A ) − P ( Z ∈ A ) (cid:12)(cid:12) d / γ + 2 E (cid:8) k W k ∆ (cid:9) + 2 n X i =1 E (cid:8) k ξ i k| ∆ − ∆ ( i ) | (cid:9) . Combining the foregoing inequalities we obtain the desired result. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Proof of Corollary . For any convex set A ⊂ R d , we have Σ − / A := { y ∈ R d : y = Σ − / x, x ∈ A } is also a convex subset of R d . To see this, it suﬃces toshow that for any y , y ∈ Σ − / A and for any t , ty + (1 − t ) y ∈ Σ − / A. (4.10)Since y , y ∈ Σ − / A , it follows that there exist x , x ∈ A such that y = Σ − / x , y = Σ − / x . Moreover, as A is convex, we have for any t , tx + (1 − t ) x ∈ A, and thus ty + (1 − t ) y = t Σ − / x + (1 − t )Σ − / x = Σ − / ( tx + (1 − t ) x ) ∈ Σ − / A. This proves (4.10) and hence Σ − / A is convex. Note that P ( T ∈ A ) − P (Σ / Z ∈ A ) = P (Σ − / T ∈ Σ − / A ) − P ( Z ∈ Σ − / A ) , and we have sup A ∈A (cid:12)(cid:12) P ( T ∈ A ) − P (Σ − / A ∈ A ) (cid:12)(cid:12) = sup A ∈A (cid:12)(cid:12) P (Σ − / T ∈ A ) − P ( Z ∈ A ) (cid:12)(cid:12) . Applying Theorem 2.1 yields the desired result.

We apply the ideas in Chen and Shao [9] and Chen and Fang [10] to proveProposition 4.1 in this subsection. Before the proof, we ﬁrst introduce somedeﬁnitions and lemmas.Given A ∈ A and ε > , we construct f A,ε : R d → R d as follows. Let P A bethe projection operator on A , that is, for any x ∈ R d , let P A ( x ) := arg min y ∈ A k x − y k . Therefore, P A ( x ) is the nearest point of x in the set A .Let ¯ A be the closure of A , and f A,ε ( x ) =  , x ∈ ¯ A,x − P ¯ A ( x ) , x ∈ A ε \ ¯ A, P ( ¯ A ) ε ( x ) − P ¯ A ( x ) , x ∈ R d \ A ε . (4.11)Let r ( ¯ A ) = max { y : B ( x, y ) ⊂ ¯ A for some x ∈ R d } be the inradius of ¯ A . Weintroduce the following lemma, whose proof can be found in Chen and Fang [10,Lemmas 2.1, 2.2 and Proposition 2.7]. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Lemma 4.2.

Let ε > and γ > and f := f A,ε +8 γ be as in (4.11) . We have (i) k f k ε + 8 γ ; (ii) for all ξ, η ∈ R d , h ξ, f ( η + ξ ) − f ( η ) i > ; (iii) for w ∈ A γ + ε \ A γ and k x k γ, we have h x, f ( w ) − f ( w − x ) i >

34 ( x · h ) , where h = ( w − w ) / k w − w k and w = P ¯ A ( w ) . Now we are ready to give the proof of Proposition 4.1.

Proof of Proposition . Let A ∈ A be nonempty such that r := r ( ¯ A ) > γ .Set ¯∆ = ∆ ∧ ( r − γ ) . Let ∆ ( i )1 and ∆ ( i )2 be any random variables that areindependent of ξ i and let ¯∆ ( i )2 = ∆ ( i )2 ∧ ( r − γ ) . For any a > and b r − γ ,deﬁne g a,b = f A − b , γ + a + b . Noting that E ξ i = 0 and observing that ∆ ( i )1 and ¯∆ ( i )2 are independent of ξ i , we have E (cid:8) h ξ i , g ∆ ( i )1 , ¯∆ ( i )2 ( W − ξ i ) i (cid:9) = 0 , and thus, E {h W, g ∆ , ¯∆ ( W ) i} = n X i =1 (cid:16) E {h ξ i , g ∆ , ¯∆ ( W ) i} − E (cid:8) h ξ i , g ∆ ( i )1 , ¯∆ ( i )2 ( W − ξ i ) i (cid:9)(cid:17) = H + H , (4.12)where H = n X i =1 E (cid:8)(cid:10) ξ i , g ∆ , ¯∆ ( W ) − g ∆ , ¯∆ ( W − ξ i ) (cid:11)(cid:9) ,H = n X i =1 E (cid:8) h ξ i , g ∆ , ¯∆ ( W − ξ i ) − g ∆ ( i )1 , ¯∆ ( i )2 ( W − ξ i ) i (cid:9) . For the upper bound of H , by the deﬁnition of f , we have k g ∆ , ¯∆ ( w ) − g ∆ ( i )1 , ¯∆ ( i )2 ( w ) k (cid:13)(cid:13) g ∆ , ¯∆ ( w ) − g ∆ ( i )1 , ¯∆ ( w ) (cid:13)(cid:13) + (cid:13)(cid:13) g ∆ ( i )1 , ¯∆ ( w ) − g ∆ ( i )1 , ¯∆ ( i )2 ( w ) (cid:13)(cid:13) . (4.13)Without loss of generality, assume that ∆ ( i )1 ∆ . Let A = ( ¯ A ) − ¯∆ , A = A γ +∆ ( i )1 , A = A γ +∆ and w j = P A j ( w ) for j = 2 , , .If w ∈ A ⊂ A , then g ∆ , ¯∆ ( w ) = g ∆ ( i )1 , ¯∆ ( w ); if w ∈ A \ A , then g ∆ , ¯∆ ( w ) = w − w , g ∆ ( i )1 , ¯∆ ( w ) = w − w , .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics and k w − w k = k w − w k , for w ∈ A \ A ; if w ∈ A c , then g ∆ , ¯∆ ( w ) = w − w and g ∆ ( i )1 , ¯∆ ( w ) = w − w . By the deﬁnition of w and w , it follows that k w − w k | ∆ − ∆ ( i )1 | . Hence, (cid:13)(cid:13) g ∆ , ¯∆ ( w ) − g ∆ ( i )1 , ¯∆ ( w ) (cid:13)(cid:13) | ∆ − ∆ ( i )1 | . (4.14)Similarly, (cid:13)(cid:13) g ∆ ( i )1 , ¯∆ ( w ) − g ∆ ( i )1 , ¯∆ ( i )2 ( w ) (cid:13)(cid:13) | ¯∆ − ¯∆ ( i )2 | | ∆ − ∆ ( i )2 | . (4.15)By (4.13)–(4.15), H n X i =1 E {k ξ i k ( | ∆ − ∆ ( i )1 | + | ∆ − ∆ ( i )2 | ) } . (4.16)We next estimate the lower bound of H . By Lemma 4.2, we have H = n X i =1 E (cid:8)(cid:10) ξ i , g ∆ , ¯∆ ( W ) − g ∆ , ¯∆ ( W − ξ i ) (cid:11)(cid:9) > n X i =1 E n(cid:10) ξ i , g ∆ , ¯∆ ( W ) − g ∆ , ¯∆ ( W − ξ i ) (cid:11) ( | ξ i | γ ) ( W ∈ A γ +∆ \ A γ − ¯∆ ) o > n X i =1 E n h ξ i , U i ( k ξ i k γ ) ( W ∈ A γ +∆ \ A γ − ¯∆ ) o := 34 R (4.17)where U := ( W − W ) / k W − W k = ( U , . . . , U d ) and W = P ¯ A ( W ) . Observethat by (4.17), R = n X i =1 d X j =1 E (cid:8) ξ ij U j ( k ξ i k γ ) ( W ∈ A γ +∆ \ A γ − ¯∆ ) (cid:9) + n X i =1 X j = j ′ E (cid:8) ξ ij ξ ij ′ U j U j ′ ( k ξ i k γ ) ( W ∈ A γ +∆ \ A γ − ¯∆ ) (cid:9) := R + R . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics For R , rearranging the summations yields R = d X j =1 E (cid:26) ( W ∈ A γ +∆ \ A γ − ¯∆ ) U j n X i =1 ξ ij ( k ξ i k γ ) (cid:27) = d X j =1 E (cid:26) ( W ∈ A γ +∆ \ A γ − ¯∆ ) U j (cid:18) n X i =1 (cid:16) ξ ij ( k ξ i k γ ) − E ξ ij ( k ξ i k γ ) (cid:17)(cid:19)(cid:27) + d X j =1 (cid:16) E { ( W ∈ A γ +∆ \ A γ − ¯∆ ) U j } (cid:17)(cid:18) n X i =1 E { ξ ij ( k ξ i k γ ) } (cid:19) := R + R . By the basic inequality that ab γa + (1 / γ ) b for a, b > , it follows thatwith a = U j and b = (cid:12)(cid:12)(cid:12)(cid:12) n X i =1 (cid:16) ξ ij ( k ξ i k γ ) − E ξ ij ( k ξ i k γ ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) , we have | R | d X j =1 E (cid:26) U j (cid:12)(cid:12)(cid:12)(cid:12) n X i =1 (cid:16) ξ ij ( k ξ i k γ ) − E ξ ij ( k ξ i k γ ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:27) γ d X j =1 E { U j } + 14 γ d X j =1 Var (cid:18) n X i =1 ξ ij ( k ξ i k γ ) (cid:19) γ d X j =1 E { U j } + 14 γ d X j =1 n X i =1 E (cid:8) ξ ij ( k ξ i k γ ) (cid:9) . (4.18)As for R , recalling that P dj =1 U i = 1 and P ni =1 E { ξ i ξ ⊺ i } = I d , we have P ni =1 E { ξ ij } = 1 for each j d , and R = d X j =1 E n ( W ∈ A γ +∆ \ A γ − ¯∆ ) U j o(cid:18) n X i =1 E { ξ ij } − n X i =1 E { ξ ij ( k ξ i k > γ ) } (cid:19) = P ( W ∈ A γ +∆ \ A γ − ¯∆ ) (4.19) − E (cid:26) ( W ∈ A γ +∆ \ A γ − ¯∆ ) d X j =1 U j (cid:18) n X i =1 E { ξ ij ( k ξ i k > γ ) } (cid:19)(cid:27) . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics By (4.18) and (4.19), it follows that R > P ( W ∈ A γ +∆ \ A γ − ¯∆ ) − γ d X j =1 E { U j } − γ n X i =1 d X j =1 E (cid:8) ξ ij ( k ξ i k γ ) (cid:9) − E (cid:26) ( W ∈ A γ +∆ \ A γ − ¯∆ ) d X j =1 U j (cid:18) n X i =1 E { ξ ij ( k ξ i k > γ ) } (cid:19)(cid:27) . (4.20)Similarly, noting that P ni =1 E { ξ ij ξ ij ′ } = 0 for j = j ′ , we have R > − γ X j = j ′ E (cid:8) U j U j ′ (cid:9) − γ n X i =1 X j = j ′ d E (cid:8) ( ξ ij ξ ij ′ ) ( k ξ i k γ ) (cid:9) − n X i =1 X j = j ′ d E n ( W ∈ A γ +∆ \ A γ − ¯∆ ) U j U j ′ E { ξ ij ξ ij ′ ( k ξ i k γ ) } o = − γ X j = j ′ E (cid:8) U j U j ′ (cid:9) − γ n X i =1 X j = j ′ d E (cid:8) ( ξ ij ξ ij ′ ) ( k ξ i k γ ) (cid:9) − n X i =1 X j = j ′ d E n ( W ∈ A γ +∆ \ A γ − ¯∆ ) U j U j ′ E { ξ ij ξ ij ′ } o − n X i =1 X j = j ′ d E n ( W ∈ A γ +∆ \ A γ − ¯∆ ) U j U j ′ E { ξ ij ξ ij ′ ( k ξ i k > γ ) } o = − γ X j = j ′ E (cid:8) U j U j ′ (cid:9) − γ n X i =1 X j = j ′ d E (cid:8) ( ξ ij ξ ij ′ ) ( k ξ i k γ ) (cid:9) − n X i =1 X j = j ′ d E n ( W ∈ A γ +∆ \ A γ − ¯∆ ) U j U j ′ E { ξ ij ξ ij ′ ( k ξ i k > γ ) } o . (4.21)Observe that n X i =1 E {k ξ i k ( k ξ i k γ ) } γ n X i =1 E k ξ i k γ (4.22)and by the Markov inequality, n X i =1 E {k ξ i k ( k ξ i k > γ ) } γ n X i =1 E k ξ i k = 14 . (4.23) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Recall that k U k = 1 and γ = P ni =1 E {k ξ i k } , and thus (4.20)–(4.23) yield R > P ( W ∈ A γ +∆ \ A γ − ¯∆ ) − γ E {k U k } − γ n X i =1 E {k ξ i k ( k ξ i k γ ) }− n X i =1 E n ( W ∈ A γ +∆ \ A γ − ¯∆ ) E n(cid:16) d X j =1 U j ξ ij (cid:17) ( k ξ i k > γ ) oo > P ( W ∈ A γ +∆ \ A γ − ¯∆ ) − γ E {k U k } − γ n X i =1 E {k ξ i k ( k ξ i k γ ) }− n X i =1 E n ( W ∈ A γ +∆ \ A γ − ¯∆ ) E {k U k k ξ i k ( k ξ i k > γ ) } o = P ( W ∈ A γ +∆ \ A γ − ¯∆ ) − γ E {k U k } − γ n X i =1 E {k ξ i k ( k ξ i k γ ) }− n X i =1 E n ( W ∈ A γ +∆ \ A γ − ¯∆ ) E {k ξ i k ( k ξ i k > γ ) } o > P ( W ∈ A γ +∆ \ A γ − ¯∆ ) − γ − P ( W ∈ A γ +∆ \ A γ − ¯∆ )= 34 P ( W ∈ A γ +∆ \ A γ − ¯∆ ) − γ. (4.24)By (4.17) and (4.24), we have H > P ( W ∈ A γ +∆ \ A γ − ¯∆ ) − γ. (4.25)On the other hand, note that E k W k = d and by Lemma 4.2, k g ∆ , ¯∆ ( W ) k (∆ + ∆ + 8 γ ) . Thus, | E (cid:10) W, g ∆ , ¯∆ ( W ) (cid:11) | E k W k (∆ + ∆ ) + 8 d / γ. (4.26)Combining inequalities (4.12), (4.16), (4.25) and (4.26) yields P ( W ∈ A γ +∆ \ A γ − ¯∆ ) (cid:12)(cid:12)(cid:12) E (cid:10) W, g ∆ , ∆ ( W ) (cid:11)(cid:12)(cid:12)(cid:12) + 2 H + 3 γ E (cid:8) k W k (∆ + ∆ ) (cid:9) + 16 d / γ + 3 γ + 2 n X i =1 2 X j =1 E k ξ i k| ∆ j − ∆ ( i ) j | E k W k (∆ + ∆ ) + 19 d / γ + 2 n X i =1 2 X j =1 E k ξ i k (cid:12)(cid:12) ∆ j − ∆ ( i ) j (cid:12)(cid:12) . This proves (4.1). .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics

5. Proofs of other results

In this section, we give the proofs of the theorems in Section 3.

Note that ˆ θ n minimizes M n ( θ ) , and m θ is smooth for θ . By the Taylor expansion,it follows that n n X i =1 ˙ m ˆ θ n ( X i ) = 1 n n X i =1 ˙ m θ ∗ ( X i ) + 1 n n X i =1 Z (cid:0) ¨ m θ t ( X i ) (cid:1)(cid:0) ˆ θ n − θ ∗ (cid:1) dt, where θ t = θ ∗ + t (ˆ θ n − θ ∗ ) . Therefore, recalling that V = E { ¨ m θ ∗ ( X ) } and ξ i = ˙ m θ ∗ ( X i ) , V (cid:0) ˆ θ n − θ ∗ (cid:1) = − n n X i =1 ξ i − (cid:16) n n X i =1 ¨ m θ ∗ ( X i ) − V (cid:17)(cid:0) ˆ θ n − θ ∗ (cid:1) − n n X i =1 Z (cid:0) ¨ m θ t ( X i ) − ¨ m θ ∗ ( X i ) (cid:1)(cid:0) ˆ θ n − θ ∗ (cid:1) dt, Let W = − √ n n X i =1 Σ − / ξ i , and D = −√ n Σ − / (cid:16) n n X i =1 ¨ m θ ∗ ( X i ) − V (cid:17)(cid:0) ˆ θ n − θ ∗ (cid:1) − √ n Σ − / (cid:16) n n X i =1 Z (cid:0) ¨ m θ ( t ) ( X i ) − ¨ m θ ∗ ( X i ) (cid:1) dt (cid:17)(cid:0) ˆ θ n − θ ∗ (cid:1) . Then, we have T := √ n Σ − / V (ˆ θ n − θ ∗ ) = W + D. (5.1)By (M1) and (M2), we have k D k n / λ − / (cid:0) H k ˆ θ n − θ ∗ k + H k ˆ θ n − θ ∗ k (cid:1) , (5.2)where H = (cid:13)(cid:13)(cid:13) n n X i =1 (cid:0) ¨ m θ ∗ ( X i ) − E { ¨ m θ ∗ ( X i ) } (cid:1)(cid:13)(cid:13)(cid:13) , H = 1 n n X i =1 k m ( X i ) k . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Let ∆ = n / λ − / (cid:0) H k ˆ θ n − θ ∗ k + H k ˆ θ n − θ ∗ k (cid:1) , and it follows that k T − W k ∆ . Let ( X ′ , . . . , X ′ n ) be an independent copy of ( X , . . . , X n ) , and deﬁne X ( i ) j = ( X j , j = i ; X ′ i , j = i. Moreover, let H ( i )1 = (cid:13)(cid:13)(cid:13) n n X j =1 (cid:0) ¨ m θ ∗ ( X ( i ) j ) − E { ¨ m θ ∗ ( X j ) } (cid:1)(cid:13)(cid:13)(cid:13) , H ( i )2 = 1 n n X j =1 k m ( X ( i ) j ) k ˆ θ ( i ) n = arg min θ ∈ Θ n n X j =1 m θ ( X ( i ) j ) , ∆ ( i ) = n / λ − / (cid:0) H ( i )1 k ˆ θ ( i ) n − θ ∗ k + H ( i )2 k ˆ θ ( i ) n − θ ∗ k (cid:1) . Then, ∆ ( i ) is independent of X i and ξ i . Theorem 3.1 follows directly fromTheorem 2.1 and the following proposition. Proposition 5.1.

Assume that conditions (M1) and (M2) are satisﬁed. Then,we have n X i =1 E k Σ − / ξ i k Cd / n, (5.3) E {k W k ∆ } Cd / n − / , (5.4) n X i =1 E k Σ − / ξ i k| ∆ − ∆ ( i ) | Cd / , (5.5) where C > is a constant depending only on λ , λ , c , c , c , c and µ . In order to prove Proposition 5.1, we ﬁrst need to prove three useful lemmas,whose proofs are postponed to Appendix A.The following lemma provides an upper bound for the p -th moment of k ˆ θ n − θ ∗ k , whose proof (see Appendix A.2) is based on the ideas in Van der Vaart andWellner [32, Theorem 3.2.5]. Lemma 5.2.

Assume that there exist p > and a > such that (3.6) and (3.7) are satisﬁed with k m ( X ) k p +1 a . Then, we have E k ˆ θ n − θ ∗ k p Cµ − p − a p +16 d p +12 n − p , where C > is a constant depending only on p . The next lemma gives upper bounds for the fourth moments of H and H .The proof can be found in Appendix A.3, where we use the Rosenthal-typeinequality for random matrices (see, e.g, Chen et al. [12, Theorem A.1]). Lemma 5.3.

Under the assumption (M1) , we have E (cid:8) H (cid:9) Cc n − , (5.6) E (cid:8) H (cid:9) Cc , (5.7) where C > is an absolute constant. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics The following lemma gives an upper bound of E k ˆ θ n − ˆ θ ( i ) n k , whose proof isgiven in Appendix A.4. Lemma 5.4.

Under the assumptions (M1) and (M2) , we have E k ˆ θ n − ˆ θ ( i ) n k Cλ − n − (cid:16) c d + µ − / c / c d / + µ − / c / c d / (cid:17) (5.8) where C > is an absolute constant. With Lemmas 5.2–5.4, we are ready to give the proof of Proposition 5.1.

Proof of Proposition . In this proof, we denote by C a general positive con-stant depending on λ , λ , c ,c , c , c and µ , and the value of C might be diﬀerent in diﬀerent places. Thebound (5.3) follows from (M2). For (5.4), since ∆ = n / λ − / ( H k ˆ θ n − θ ∗ k + H k ˆ θ n − θ ∗ k ) , it follows from Lemmas 5.2 and 5.3 and the Hölder inequality that E (cid:8) ∆ k W k (cid:9) Cn / λ − / (cid:16) E (cid:8) H k W kk ˆ θ n − θ ∗ k + H k W kk ˆ θ n − θ ∗ k (cid:9)(cid:17) Cn / λ − / (cid:16) k H k k W k k ˆ θ n − θ ∗ k + k H k k W k k ˆ θ n − θ ∗ k (cid:17) Cd n − / , (5.9)and E ∆ Cd / n − / . (5.10)Therefore (5.9) and (5.10) yield (5.4).For (5.5), we have | ∆ − ∆ ( i ) | n / λ − / (cid:16) H k ˆ θ n − ˆ θ ( i ) n k + | H − H ( i )1 |k ˆ θ n − θ ( i ) n k + | H − H ( i )2 |k ˆ θ ( i ) n − θ ∗ k + H ( k ˆ θ n − θ ∗ k + k ˆ θ ( i ) n − θ ∗ k ) k ˆ θ n − ˆ θ ( i ) n k (cid:17) . By the assumption (M1), k H − H ( i )1 k n k ¨ m θ ∗ ( X i ) − ¨ m θ ∗ ( X ( i ) i ) k Cn − and k H − H ( i )2 k n (cid:16) k m ( X i ) k + k m ( X ( i ) i ) k (cid:17) Cn − . By Lemmas 5.2–5.4 and the Hölder inequality, we have E {k Σ − / ξ i k| ∆ − ∆ ( i ) |} Cd / n − , which yields (5.5). .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Let δ n = ( D Θ + 1) dn − ( p − / (2 p − , where D Θ is the diameter of the parameterspace Θ . As p > , it follows that δ n > n − / . In this subsection, we denote by C > a constant depending only on p, c , c , c , λ , λ and µ , which might bediﬀerent in diﬀerent places. The main idea is to rewrite √ n Σ − / ˙Ψ (ˆ θ n − θ ∗ ) as asummation of a linear statistic plus an error term, and then apply Corollary 2.2to prove (5.16). To this end, by (3.11), √ n (cid:0) Ψ(ˆ θ n ) − Ψ( θ ∗ ) (cid:1) = √ n (cid:16) Ψ(ˆ θ n ) − Ψ n (ˆ θ n ) (cid:17) = −√ n (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1) − (cid:16) √ n (Ψ n (ˆ θ n ) − Ψ(ˆ θ n )) − √ n (Ψ n ( θ ∗ ) − Ψ( θ ∗ )) (cid:17) . (5.11)By (5.11), we obtain T := √ n Σ − / ˙Ψ (ˆ θ n − θ ∗ ) = W + D, where W = − √ n n X i =1 Σ − / ξ i ,D = − Σ − / (cid:16) √ n (Ψ n (ˆ θ n ) − Ψ(ˆ θ n )) − √ n (Ψ n ( θ ∗ ) − Ψ( θ ∗ )) (cid:17) − √ n Σ − / (cid:16) Ψ(ˆ θ n ) − Ψ( θ ∗ ) − ˙Ψ (ˆ θ n − θ ∗ ) (cid:17) . By (3.13) and (3.16), k D k ( k ˆ θ n − θ ∗ k δ n ) ∆ + ∆ , where ∆ = λ − / √ n sup θ : k θ − θ ∗ k δ n (cid:13)(cid:13)(cid:13)(cid:0) Ψ n ( θ ) − Ψ( θ ) (cid:1) − (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1)(cid:13)(cid:13)(cid:13) ∆ = c λ − / √ n k ˆ θ n − θ ∗ k (cid:0) k ˆ θ n − θ ∗ k δ n (cid:1) . Now we construct random variables ∆ ( i )1 and ∆ ( i )2 that are independent of ξ i .Let ( X ′ ,. . . , X ′ n ) be an independent copy of ( X , . . . , X n ) and let Ψ ( i ) n ( θ ) = Ψ n ( θ ) − n (cid:0) h θ ( X i ) − h θ ( X ′ i ) (cid:1) , i n. Let ˆ θ ( i ) n be the minimizer of Ψ ( i ) n , and let ∆ ( i )1 = λ − / √ n sup θ : k θ − θ ∗ k δ n (cid:13)(cid:13)(cid:13)(cid:0) Ψ ( i ) n ( θ ) − Ψ( θ ) (cid:1) − (cid:0) Ψ ( i ) n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1)(cid:13)(cid:13)(cid:13) ∆ ( i )2 = c λ − / √ n k ˆ θ ( i ) n − θ ∗ k (cid:0) k ˆ θ ( i ) n − θ ∗ k δ n (cid:1) . To apply Corollary 2.2, we need to develop the following proposition. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Proposition 5.5.

Let B δ = { θ ∈ Θ : k θ − θ ∗ k δ } . Under the conditions (B1)–(B3) , P (cid:0) ˆ θ n ∈ B cδ n (cid:1) C ( D Θ + 1) p d p δ − pn n − p/ , (5.12) n X i =1 E k Σ − / ξ i k Cd / n, (5.13) E (cid:8) k W k (∆ + ∆ ) (cid:9) Cd / δ n + C ( D Θ + 1) d / n − / , (5.14) n X i =1 2 X j =1 E (cid:8) k ξ i k (cid:12)(cid:12) ∆ j − ∆ ( i ) j (cid:12)(cid:12)(cid:9) C (cid:16) ( D Θ + 1) p d p +1 / δ − p +2 n n − p − (5.15) + ( D Θ + 1) d + ( D Θ + 1) d / n / δ n (cid:17) . By Corollary 2.2 with O = {k ˆ θ n − θ ∗ k δ n } and Proposition 5.5, sup A ∈A (cid:12)(cid:12) P (cid:0) √ n Σ − / ˙Ψ (ˆ θ n − θ ∗ ) ∈ A (cid:1) − P (cid:0) Z ∈ A (cid:1)(cid:12)(cid:12) Cd / n − / n X i =1 E k Σ − / ξ i k + C E {k W k (∆ + ∆ ) } + Cn − / n X i =1 E (cid:8) k ξ i k (∆ + ∆ − ∆ ( i )1 − ∆ ( i )2 ) (cid:9) + P (cid:0) k ˆ θ n − θ ∗ k > δ n (cid:1) Cn − / (cid:0) d + ( D Θ + 1) d / (cid:1) + C ( D Θ + 1) d / δ n + C ( D Θ + 1) p d p +1 / δ − p +2 n n − ( p − / + C ( D Θ + 1) p d p δ − pn n − p/ . (5.16)Recall that p > , and then δ n n > . Therefore,RHS of (5.16) Cn − / ( D Θ + 1) d / + C ( D Θ + 1) d / δ n + C ( D Θ + 1) p d p +1 / δ − p +2 n n − ( p − / C ( D Θ + 1) (cid:16) d / n − / + d / n − p − p − (cid:17) C ( D Θ + 1) d / n − / ε p , where ε p = 1 / (2 p − . This proves Theorem 3.2.It suﬃces to prove Proposition 5.5, and we need to apply some preliminarylemmas, whose proofs are put in Appendix A. Lemma 5.6.

Let B δ be as in Proposition . Under the assumptions (B1)–(B3) , we have k W k λ − / c d / , k W k Cλ − / c d / , (5.17) E ∆ Cλ − c d δ n , (5.18) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics where C > is an absolute constant. Moreover, E (cid:8) k ˆ θ n − θ ∗ k p (cid:9) C ( D Θ + 1) p d p n − p/ , (5.19) E n k ˆ θ n − ˆ θ ( i ) n k p (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B δ (cid:1)o C (cid:0) d p/ n − p + d p n − p/ δ p (cid:1) , (5.20) where C > is a constant depending only on c , c , µ and p . Lemma 5.7.

We have E (cid:8) k ξ i kk ˆ θ n − θ ∗ k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B cδ (cid:1)(cid:9) C ( D Θ + 1) p d p +1 / δ − p +2 n − p/ , (5.21) E (cid:8) k ξ i kk ˆ θ ( i ) n − θ ∗ k (cid:0) ˆ θ n ∈ B cδ , ˆ θ ( i ) n ∈ B δ (cid:1)(cid:9) C ( D Θ + 1) p d p +1 / δ − p +2 n − p/ , (5.22) and E n k ξ i k ( k ˆ θ n − θ ∗ k + k ˆ θ ( i ) n − θ ∗ k ) k ˆ θ n − ˆ θ ( i ) n k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B δ (cid:1)o C ( D Θ + 1) (cid:16) d n − / + d / n − δ (cid:17) . (5.23)Now we are ready to give the proof of Proposition 5.5. Proof of Proposition . The inequality (5.12) follows directly from (5.19) andthe Chebyshev inequality.For (5.13), by (B3), we have E k Σ − / ξ i k λ − / c d / , and thus (5.13) holds.For (5.14), it suﬃces to give the bounds for the moments of k W k , ∆ and ∆ . By (5.17) and (5.18) and the Cauchy inequality, it follows that E {k W k ∆ } Cd / δ n . (5.24)Recall that p > , ∆ c λ − / n / k ˆ θ n − θ k and by (5.19), E k ˆ θ n − θ ∗ k C ( D Θ + 1) d n − / , (5.25)and then by (5.17), (5.25) and the Hölder inequality, E {k W k ∆ } Cn / (cid:8) k W k k ˆ θ n − θ ∗ k (cid:9) C ( D Θ + 1) d / n − / . (5.26)Combining (5.24) and (5.26) yields (5.14).It suﬃces to prove (5.15). By the deﬁnition of ∆ and ∆ ( i )1 , (cid:12)(cid:12) ∆ − ∆ ( i )1 (cid:12)(cid:12) λ − / n − / sup θ : k θ − θ ∗ k δ n (cid:13)(cid:13) h θ ( X i ) − h θ ( X ′ i ) (cid:13)(cid:13) d / λ − / n − / δ n | h ( X ) | , .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics and by (3.15), E (cid:12)(cid:12) ∆ − ∆ ( i )1 (cid:12)(cid:12) Cdn − δ n . Thus, by (3.17) and the Cauchy inequality, n X i =1 E n k ξ i k (cid:0) | ∆ − ∆ ( i )1 | (cid:1)o Cdn / δ n . (5.27)For ∆ − ∆ ( i )2 , we have (cid:12)(cid:12) ∆ − ∆ ( i )2 (cid:12)(cid:12) c λ − / √ n (cid:16) k ˆ θ n − θ ∗ k (cid:0) k ˆ θ n − θ ∗ k δ n , k ˆ θ ( i ) n − θ ∗ k > δ n (cid:1) + k ˆ θ ( i ) n − θ ∗ k (cid:0) k ˆ θ n − θ ∗ k > δ n , k ˆ θ ( i ) n − θ ∗ k δ n (cid:1) + ( k ˆ θ n − θ ∗ k + k ˆ θ ( i ) n − θ ∗ k ) k ˆ θ n − ˆ θ ( i ) n k (cid:0) k ˆ θ n − θ ∗ k δ n , k ˆ θ ( i ) n − θ ∗ k δ n (cid:1)(cid:17) . By Lemmas 5.6 and 5.7, it follows that n X i =1 E n k ξ i k (cid:0)(cid:12)(cid:12) ∆ − ∆ ( i )2 (cid:12)(cid:12)(cid:1)o C ( D Θ + 1) p d p +1 / δ − p +2 n − ( p − / + C ( D Θ + 1) d + C ( D Θ + 1) d / n / δ n . (5.28)Together with (5.27) and (5.28), we obtain (5.15). Theorem 3.3 follows from the proof of Theorem 3.2 and the following proposi-tion.

Proposition 5.8.

Under the conditions (B1) , (B4) and (B5) , we have P (cid:0) ˆ θ n ∈ B cδ (cid:1) C exp (cid:16) − C ′ √ nδ ( D Θ + 1) d / (cid:17) , n X i =1 E k Σ − / ξ i k Cd / n E (cid:8) k W k (∆ + ∆ ) (cid:9) Cd / δ n + C ( D Θ + 1) d / n − / , n X i =1 E (cid:8) k ξ i k (cid:12)(cid:12) ∆ + ∆ − ∆ ( i )1 − ∆ ( i )2 (cid:12)(cid:12)(cid:9) C ( D Θ + 1) d / exp (cid:16) − C ′ √ nδ n D Θ + 1) d / (cid:17) + C ( D Θ + 1) d + C ( D Θ + 1) d / n / δ n , where C ′ > is a constant depending only on c , c and µ and C > is aconstant depending only on c , c , c , µ, λ and λ . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Similar to the proof of Theorem 3.2, and by Proposition 5.8, we have sup A ∈A (cid:12)(cid:12) P (cid:0) √ n Σ − / ˙Ψ (ˆ θ n − θ ∗ ) ∈ A (cid:1) − P (cid:0) Z ∈ A (cid:1)(cid:12)(cid:12) C ( D Θ + 1) d / n − / + C ( D Θ + 1) d / δ n + C ( D Θ + 1) d / exp (cid:16) − C ′ √ nδ n D Θ + 1) d / (cid:17) . Choosing δ n = ( C ′ ) − ( D Θ + 1) d / n − / log n , we completes the proof of Theorem 3.3.It suﬃces to prove Proposition 5.8.The following lemma is a modiﬁcation of Lemmas 5.6 and 5.7, whose proofis given in Appendix A. Lemma 5.9.

Let B δ be as in Proposition . Under the assumptions (B1) , (B4) and (B5) , we have (5.19) and (5.20) hold for each p > with a positiveconstant C depending on c , c , c , µ, λ , λ and p . Moreover, we have there existsa constant C ′ > depending only on c , c and µ such that P (cid:0) k ˆ θ n − θ ∗ k > t (cid:1) (cid:16) − C ′ n / t ( D Θ + 1) d / (cid:17) , for t > , (5.29) and E n k ξ i kk ˆ θ n − θ ∗ k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B cδ (cid:1)o C ( D Θ + 1) d / n − exp (cid:16) − C ′ n / δ D Θ + 1) d / (cid:17) , (5.30) E n k ξ i kk ˆ θ ( i ) n − θ ∗ k (cid:0) ˆ θ n ∈ B cδ , ˆ θ ( i ) n ∈ B δ (cid:1)o C ( D Θ + 1) d / n − exp (cid:16) − C ′ n / δ D Θ + 1) d / (cid:17) , (5.31) where C > depending only on c , c , c , µ, λ and λ .Proof of Proposition . The ﬁrst inequality follows directly from (5.29), andthe last three inequalities follow from (5.30) and (5.31) and from the proof ofProposition 5.5.

Without loss of generality, we assume that n > { (2 Lℓ ) α + 1 } ; otherwise, thebound (3.26) is trivial.In this subsection, we denote by C, C , C , . . . a sequence of general positiveconstants depending only on ℓ , λ , λ , c , c , α, β, L and µ and independent of τ and τ . Let L := max { c , L/β } and L := c + L. We introduce the followingfamily of functions: Let ϕ β : R + \ { } 7→ R be given by ϕ β ( t ) = ( t β − β , if β = 0 , log t, if β = 0 . (5.32) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Proof of Theorem . Note that θ ∗ is the minimum point of f and by the dif-ferentiability and convexity of f , we have ∇ f ( θ ∗ ) = 0 . By (3.25), ∇ f ( θ ) = ∇ f ( θ ∗ )( θ − θ ∗ ) + H ( θ ) , (5.33)where H ( θ ) = ∇ f ( θ ) − ∇ f ( θ ∗ )( θ − θ ∗ )= ∇ f ( θ ) − ∇ f ( θ ∗ ) − ∇ f ( θ ∗ )( θ − θ ∗ )= Z {∇ f ( θ ∗ + t ( θ − θ ∗ )) − ∇ f ( θ ∗ ) } ( θ − θ ∗ ) dt. By (C2) and (C3), it follows that k H ( θ ) k ( k θ − θ ∗ k β ) c k θ − θ ∗ k , and k H ( θ ) k ( k θ − θ ∗ k > β ) L k θ − θ ∗ k ( k θ − θ ∗ k > β ) Lβ k θ − θ ∗ k . Hence, with L := max { c , L/β } , we have k H ( θ ) k L k θ − θ ∗ k . (5.34)Recall that G := ∇ f ( θ ∗ ) , and it follows from (3.22) and (5.33) that for any n > , θ n = θ n − − ℓ n (cid:0) ∇ f ( θ n − ) + ζ n (cid:1) = θ n − − ℓ n (cid:0) G ( θ n − − θ ∗ ) + ξ n + η n + H ( θ n − ) (cid:1) . (5.35)By deﬁnition, (¯ θ n − θ ∗ ) = n − P n − i =0 ( θ i − θ ∗ ) . Solving the recursive system (5.35)yields √ n (¯ θ n − θ ∗ ) = 1 √ nℓ Q ( θ − θ ∗ ) + 1 √ n n − X i =1 Q i (cid:0) ξ i + η i + H ( θ i − ) (cid:1) , where Q i = ℓ i P n − j = i Q jk = i +1 ( I − ℓ k G ) . Recall that Σ n := n − P n − i =1 Q i Σ i Q ⊺ i . Let T n = n − / Σ − / n n − X i =0 ( θ n − θ ∗ ) , ζ i = 1 √ n Σ − / n Q i ξ i , W n = n − X i =1 ζ i and D n = 1 √ nℓ Σ − / n Q ( θ − θ ∗ ) + 1 √ n Σ − / n n − X i =1 Q i η i + 1 √ n Σ − / n n − X i =1 Q i H ( θ i − ):= D ,n + D ,n + D ,n . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics It is easy to show that E W n = 0 , Var( W n ) = I d . and T n = W n + D n . Also, k D n k n − / ℓ − k Σ − / n k · k Q k · k θ − θ ∗ k + n − / k Σ − / n k (cid:13)(cid:13)(cid:13)(cid:13) n − X i =1 Q i η i (cid:13)(cid:13)(cid:13)(cid:13) + n − / k Σ − / n k n − X i =1 k Q i H ( θ i − ) k The following proposition provides the bounds of Q j and Σ − n . Proposition 5.10.

Suppose that n > { (2 Lℓ ) α + 1 } . If ℓ i = ℓ i − α with / <α , then there exists a sequence ( p i ) i > , and two positive constants C and C depending on ℓ , λ , λ , c , c , α, β, L and µ such that for each i n − , Σ − n C I d , − p i I d Q i p i I d , (5.36) where p i ( C , if ( α = 1 , ℓ µ > or ( α ∈ (1 / , C log n, if ( α = 1 , ℓ µ = 1) . Let ( ξ ′ , . . . , ξ ′ n ) be an independent copy of ( ξ , . . . , ξ n ) . For each i n − ,we now construct D ( i )2 ,n and D ( i )3 ,n which are independent of ξ i . Firstly, for each i , we construct θ ( i )1 , . . . , θ ( i ) n as follows:(a) If j < i , let θ ( i ) j = θ j .(b) If j = i , let θ ( i ) j = θ ( i ) j − − ℓ j ( ∇ f ( θ ( i ) j − ) + ξ ′ j + η ( i ) j ) , where η ( i ) j = g ( θ ( i ) j − , ξ ′ i ) .(c) If j > i , let θ ( i ) j = θ ( i ) j − − ℓ j ( ∇ f ( θ ( i ) j − ) + ξ j + η ( i ) j ) , where η ( i ) j = g ( θ ( i ) j − , ξ j ) .Secondly, let D ( i )2 ,n = n − / Σ − / n n − X j =1 Q i η ( i ) j ,D ( i )3 ,n = n − / Σ − / n n − X j =1 Q i H ( θ ( i ) j − ) . Then, we have for each i n − , D ( i )2 ,n and D ( i )3 ,n is independent of ξ i . Let ∆ = ∆ + ∆ + ∆ , .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics where ∆ = k D ,n k , ∆ = k D ,n k and ∆ = C L n − / P n − i =1 p i (cid:13)(cid:13) θ i − − θ ∗ (cid:13)(cid:13) , where C is given as in (5.38). By (5.34), it follows that k D ,n k ∆ . Also, foreach i n − , deﬁne ∆ ( i )1 = k D ,n k , ∆ ( i )2 = k D ( i )2 ,n k , ∆ ( i )3 = C L n − / n − X j =1 p j (cid:13)(cid:13) θ ( i ) j − − θ ∗ (cid:13)(cid:13) . Clearly, ∆ ( i )1 , ∆ ( i )2 and ∆ ( i )3 are independent of ξ i for each i n − . Thefollowing proposition provides the bounds of the moments for ∆ j and ∆ j − ∆ ( i ) j , j = 1 , , . Proposition 5.11.

We have ∆ is independent of ( ξ , . . . , ξ n ) and E { ∆ k W k} C ( τ + τ ) n − / .

1. For α ∈ (1 / , , E { ∆ k W k} Cd / ( τ + τ ) n − α/ , E { ∆ k W k} Cd / ( τ + τ ) n − α +1 / . and n − X i =1 E {| ∆ − ∆ ( i )2 |k ζ i k} C ( τ + τ ) n − α +1 / , n − X i =1 E {| ∆ − ∆ ( i )3 |k ζ i k} C ( τ + τ ) n − α/ .

2. For α = 1 , E { ∆ k W k} ( Cd / ( τ + τ ) n − / (log n ) / , ℓ µ > Cd / ( τ + τ ) n − / (log n ) , ℓ µ = 1 . E { ∆ k W k} ( Cd / ( τ + τ ) n − / (log n ) , ℓ µ > Cd / ( τ + τ ) n − / (log n ) / , ℓ µ = 1 , n − X i =1 E (cid:8)(cid:12)(cid:12) ∆ − ∆ ( i )2 (cid:12)(cid:12) k ζ i k (cid:9) C ( τ + τ ) × ( n − / , ℓ µ > n − / (log n ) / , ℓ µ = 1 . and n − X i =1 E (cid:8)(cid:12)(cid:12) ∆ − ∆ ( i )3 (cid:12)(cid:12) k ζ i k (cid:9) C ( τ + τ ) × ( n − / , µℓ > n − / (log n ) / , µℓ = 1 . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics We apply Theorem 2.1 to prove the Berry–Esseen bound for √ n Σ − / n (¯ θ n − θ ∗ ) . (1). For / < α < . Firstly, by Proposition 5.10 and (C1), we have n − X i =1 E k ζ i k Cn − / n − X i =1 E k ξ i k Cn − / τ . (5.37)By Proposition 5.11, we have E {k W k ∆ } C ( d / + τ + τ ) n − α +1 / , n − X i =1 E {k ζ i k · | ∆ − ∆ ( i ) |} C ( d / + τ + τ ) n − α/ . (5.38)Substituting (5.37) and (5.38) to Theorem 2.1 yields (3.26). (2). For α = 1 . By the deﬁnition of ζ i and by (5.36), n − X i =1 E k ζ i k Cn − / n − X i =1 p i E k ξ i k ( Cτ n − / , if ℓ µ > ,Cτ n − / (log n ) , if ℓ µ = 1 . (5.39)By Proposition 5.11, we have E { ∆ k W k} C ( d + τ + τ ) n − / + C ( d / + τ + τ ) × ( n − / (log n ) , ℓ µ > n − / (log n ) / , ℓ µ = 1 . (5.40)Deﬁne ∆ ( i ) = ∆ + ∆ ( i )2 + ∆ ( i )3 , then we have ∆ ( i ) is independent of ζ i . Also, ∆ − ∆ ( i ) = ∆ − ∆ ( i )2 + ∆ − ∆ ( i )3 . By Proposition 5.11, we have n − X i =1 E {| ∆ − ∆ ( i ) |k ζ i k} C ( d / + τ + τ ) × ( n − / , ℓ µ > n − / (log n ) / , ℓ µ = 1 . (5.41)Then the bound (3.27) follows from Theorem 2.1 and (5.39)–(5.41).Now we are ready to give the proofs of Propositions 5.10 and 5.11. We ﬁrstprove Proposition 5.11. To prove Proposition 5.11, we need to apply some pre-liminary lemmas, which provide the bounds for E k θ n − θ ∗ k , E k θ n − θ ( i ) n k and E k θ n − θ k . The proofs of the lemmas can be found in Appendix A. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Lemma 5.12.

For α ∈ (1 / , , we have E k θ n − θ ∗ k Cn − α ( τ + τ ) , for n > . (5.42) For α = 1 , we have E k θ n − θ ∗ k  Cn − ( τ + τ ) , µℓ > ,Cn − (log n )( τ + τ ) , µℓ = 1 ,Cn − µℓ ( τ + τ ) , < µℓ < . (5.43) Lemma 5.13.

For α ∈ (1 / , , we have E k θ j − θ ( i ) j k C ( τ + τ ) i − α exp n − µ (cid:0) ϕ − α ( j ) − ϕ − α ( i ) (cid:1)o (5.44) C ( τ + τ ) j − α . (5.45) For α = 1 , we have E k θ j − θ ( i ) j k C ( τ + τ ) i − (cid:16) ij (cid:17) µℓ . (5.46) Here, ϕ − α is as given in (5.32) . Lemma 5.14.

For α ∈ (0 , , E k θ j − θ ∗ k Cj − α ( τ + τ ) . (5.47) For α = 1 , E k θ j − θ ∗ k ( Cj − , ℓ µ > ,Cj − log j, ℓ µ = 1 . (5.48) Proof of Proposition . Recall that we assume that n > { (2 Lℓ ) α +1 } . Nowwe consider the following two cases.

1. If α ∈ (1 / , . First, by Proposition 5.10, Σ − n C I d , Q j C I d foreach j n − and n > { (2 Lℓ ) α + 1 } . For ∆ , by (C0), we have E ∆ Cn − E k θ − θ ∗ k Cτ n − . By the Cauchy inequality and noting that E { W W ⊺ } = I d , we have E { ∆ k W k} Cd / τ n − / . Recall that by (C1), ( η j ) j > is a martingale diﬀerence sequence and k η j k .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics c k θ j − − θ ∗ k , and then by (5.36) and (5.42), if α ∈ (1 / , , E ∆ λ − E (cid:13)(cid:13)(cid:13) √ n n − X i =1 Σ − / n Q i η i (cid:13)(cid:13)(cid:13) Cn − n − X i =1 E k η i k Cn − n − X i =1 E k θ i − − θ ∗ k Cn − ( τ + τ ) n − X i =1 i − α Cn − α ( τ + τ ) . Recall that E W W ⊺ = I d and thus E k W k Cd , then by the Cauchy inequalityagain, E { ∆ k W k} Cd / ( τ + τ ) n − α/ . For ∆ , by Proposition 5.10 and Lemma 5.14, E { ∆ k W k} Cn − / n − X i =1 E {k θ i − − θ ∗ k k W k} Cd / n − / n − X i =1 (cid:0) E k θ i − − θ ∗ k (cid:1) / Cd / n − / (cid:18) n − X i =1 (cid:0) E k θ i − θ ∗ k (cid:1) / + ( E k θ − θ ∗ k ) / (cid:19) Cd / n − / ( τ + τ ) n − X i =1 i − α Cd / n − α +1 / ( τ + τ ) . (5.49)Now we move to give the bounds of E {| ∆ − ∆ ( i )2 |k ζ i k} and E {| ∆ − ∆ ( i )2 |k ζ i k} .For k ζ i k , by (C1) and Proposition 5.10, we have E k ζ i k = n − E k Σ − / n Q i ξ i k Cn − τ . (5.50)For ∆ − ∆ ( i )2 , | ∆ − ∆ ( i )2 | n − / (cid:13)(cid:13)(cid:13)(cid:13) n − X j =1 Σ − n Q j ( η j − η ( i ) j ) (cid:13)(cid:13)(cid:13)(cid:13) , (5.51)and η j − η ( i ) j =  , j < i ; g ( θ j − , ξ j ) − g ( θ j − , ξ ′ j ) , j = i ; g ( θ j − , ξ j ) − g ( θ ( i ) j − , ξ j ) , j > i. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics By the construction of η ( i ) j and by (3.23), for each j , η j = d η ( i ) j and k η j k c k θ j − − θ ∗ k ; and for j > i , k η j − η ( i ) j k c k θ j − − θ ( i ) j − k . Set F ( i ) j = ( F j , j < i ; F j W σ ( ξ ′ i ) , j > i. Then ( η j − η ( i ) j ) j > is a martingale diﬀerence sequence with respect to F ( i ) j .Hence, by (5.36) and (5.51), E (cid:12)(cid:12) ∆ − ∆ ( i )2 (cid:12)(cid:12) n − E k η i − η ( i ) i k + 2 E (cid:13)(cid:13)(cid:13) √ n n − X j = i +1 Q j ( η j − η ( i ) j ) (cid:13)(cid:13)(cid:13) n − E k η i k + 2 n − n − X j = i +1 p i E k η j − η ( i ) j k c n − E k θ i − − θ ∗ k + 2 c n − n − X j = i +1 p j E k θ j − − θ ( i ) j − k C ( τ + τ ) n − i − α + C ( τ + τ ) n − (cid:0) ϕ − α ( n − − ϕ − α ( i ) (cid:1) C ( τ + τ ) n − i − α +1 , (5.52)where we used (5.42) and (5.45) in the last second line. By (5.50) and (5.52)and the Cauchy inequality, we have n − X i =1 E (cid:8) | ∆ − ∆ ( i )2 |k ζ i k (cid:9) C ( τ + τ ) n − α +1 / . For ∆ − ∆ ( i )3 , by the Hölder inequality, and noting that θ j d = θ ( i ) j , n − X i =1 n X j =1 E n k ξ i k (cid:16)(cid:12)(cid:12) k θ j − − θ ∗ k − k ˆ θ ( i ) j − − θ ∗ k (cid:12)(cid:12)(cid:17)o n − X i =1 n − X j =0 E (cid:8) k ξ i k (cid:0) k θ j − θ ∗ k · k θ j − ˆ θ ( i ) j k (cid:1)(cid:9) + n − X i =1 n − X j =0 E (cid:8) k ξ i k (cid:0) k θ ( i ) j − θ ∗ k · k θ j − ˆ θ ( i ) j k (cid:1)(cid:9) n − X i =1 n − X j =0 (cid:0) E k ξ i k (cid:1) / (cid:0) E k θ j − θ ∗ k (cid:1) / (cid:0) E k θ j − ˆ θ ( i ) j k (cid:1) / C ( τ + τ ) n − X i =1 n X j = i j − α/ i − α exp (cid:8) − C (cid:0) j − α − i − α (cid:1)(cid:9) , .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics where we used (5.44), (5.47) and (5.50) in the last line. Observe that n X j = i j − α/ exp (cid:8) − C (cid:0) j − α − i − α (cid:1)(cid:9) Cn α/ , and it follows that n − X i =1 E {| ∆ − ∆ ( i )3 |k ζ i k} Cn − n − X i =1 n − X j =0 E (cid:8) k ξ i k (cid:12)(cid:12) k θ j − θ ∗ k − k ˆ θ ( i ) j − θ ∗ k (cid:12)(cid:12)(cid:9) C ( τ + τ ) n − α/ . (5.53)

2. If α = 1 . Since the bound of E { ∆ k W k} does not depend on α , it suﬃcesto give the bounds of E { ∆ j k W k} and P i E {| ∆ j − ∆ ( i ) j |k ζ i k} for j = 2 , .By Proposition 5.10, we have Σ − / n C I d , and for i n − , p i ( C, ℓ µ > C (log n ) , ℓ µ = 1 . For ∆ , noting that k η i k c k θ i − θ ∗ k , by Proposition 5.10 with α = 1 , andby (5.43), we have E ∆ λ − E (cid:13)(cid:13)(cid:13) √ n n − X i =1 Σ − / n Q i η i (cid:13)(cid:13)(cid:13) Cn − n − X i =1 p i E k η i k  C ( τ + τ ) n − n − X i =1 i − , if ℓ µ > ,C ( τ + τ ) n − n − X i =1 (log n ) i − log i, if ℓ µ = 1 ( C ( τ + τ ) n − log n, if ℓ µ > ,C ( τ + τ ) n − (log n ) , if ℓ µ = 1 . Recalling that E W W ⊺ = I d , we obtain E { ∆ k W k} ( Cd / ( τ + τ ) n − / (log n ) / , ℓ µ > Cd / ( τ + τ ) n − / (log n ) , ℓ µ = 1 , .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Similar to (5.49), and by Proposition 5.10 and Lemma 5.14, we have E { ∆ k W k} Cn − / n − X i =1 p i E {k θ i − − θ ∗ k k W k} Cd / n − / n − X i =1 p i (cid:16) E k θ i − − θ ∗ k (cid:17) / ( Cd / n − / ( τ + τ ) P n − i =1 i − , ℓ µ > Cd / n − / log n ( τ + τ ) P n − i =1 i − (log i ) / , ℓ µ = 1 , ( Cd / ( τ + τ ) n − / (log n ) , ℓ µ > Cd / ( τ + τ ) n − / (log n ) / , ℓ µ = 1 . Similar to (5.52), and note that (5.36), we have E (cid:12)(cid:12) ∆ − ∆ ( i )2 (cid:12)(cid:12) C ( τ + τ ) × ( n − i − , ℓ µ > n − (log n ) i − log i, ℓ µ = 1 . By (C1) and Proposition 5.10, E k ζ i k Cn − p i E k ξ i k ( Cn − τ , ℓ µ > ,Cn − τ (log n ) , ℓ µ = 1 . By the Cauchy inequality, n − X i =1 E (cid:8)(cid:12)(cid:12) ∆ − ∆ ( i )2 (cid:12)(cid:12) k ζ i k (cid:9) C ( τ + τ ) n − ×  n − X i =1 i − / , ℓ µ > , n − X i =1 i − / (log i ) / (log n ) , ℓ µ = 1 , C ( τ + τ ) × ( n − / , ℓ µ > n − / (log n ) / , ℓ µ = 1 . Similar to (5.53), and by (5.36), we have n − X i =1 E (cid:8)(cid:12)(cid:12) ∆ − ∆ ( i )3 (cid:12)(cid:12) k ζ i k (cid:9) Cn − / n − X i =1 n X j = i p i E (cid:8) k ξ i k (cid:12)(cid:12) k θ j − θ ∗ k − k ˆ θ ( i ) j − θ ∗ k (cid:12)(cid:12)(cid:9) C ( τ + τ ) × ( n − / , µℓ > n − / (log n ) / , µℓ = 1 . This completes the proof. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Proof of Proposition . Note that ℓ k = ℓ k − α is decreasing in k , and recallthat G = ∇ f ( θ ∗ ) . By (3.24), we have µI d G LI d and L − I d G − µ − I d . Let i = ⌊ (2 Lℓ ) α + 1 ⌋ . For i > i , we have ℓ i − α µ ℓ i − α L / .Then, Q i < ℓ i n − X j = i j Y k = i +1 ( I − ℓ i G )= ℓ i (cid:8) I + ( I − ℓ i G ) + · · · + ( I − ℓ i G ) n − i − (cid:9) = G − − ( I − ℓ i G ) n − i G − < L − (cid:8) − (cid:0) − ℓ i µ (cid:1) n − i (cid:9) I d < , (5.54)and for i + 1 i n/ , − (cid:0) − ℓ i µ (cid:1) n − i > − (cid:0) − ℓ i µ (cid:1) n/ > − exp n − / µnℓ i o > − exp n − ℓ µ o := c G , (5.55)where we used the fact that nℓ i > ℓ in the last inequality.Recall that for each i n − , λ min (Σ i ) > λ , for any i + 1 i n/ ,by (C1), (C2), (5.54) and (5.55), Q i Σ i Q i < c G L − λ I d . By the assumption that n > { (2 Lℓ ) α + 1 } , it follows that n is large enoughsuch that i n/ . Therefore, n n − X i =1 Q i Σ i Q i = 1 n i X i =1 Q i Σ i T i + 1 n n X i = i +1 Q i Σ i Q i < n n X i = i +1 Q i Σ i Q i < c G λ L I d , because Q i Σ i Q i < for each i i . Therefore, Σ − n = (cid:18) n n − X i =1 Q i Σ i Q i (cid:19) − L c G λ I d . This proves the ﬁrst inequality of (5.36).Now we move to prove the second inequality of (5.36). The following proofuses the idea of Polyak and Juditsky [24, pp. 845–846]. Write V mi = m − Y k = i (cid:0) I − ℓ k G (cid:1) , U mi = ( V mi ) ⊺ G − V mi , .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics and it follows that U m +1 i = U mi − ℓ m ( V mi ) ⊺ ( V mi ) + ℓ m U mi . Recall that by (C2), µI d G LI d , and then U mi µ − ( V mi ) ⊺ ( V mi ) U mi < L − ( V mi ) ⊺ ( V mi ) (5.56)and U m +1 i (cid:0) − ℓ m µ + ℓ m (cid:1) U mi . Therefore, for j > i , U ji exp (cid:26) − µ j − X k = i ℓ k (cid:27) exp (cid:26) j − X k = i ℓ k (cid:27) U ii CL exp (cid:26) − µ j − X k = i ℓ k (cid:27) I d . (5.57)If α ∈ (1 / , , U ji C exp (cid:26) − µ (cid:0) ϕ − α ( j − − ϕ − α ( i − (cid:1)(cid:27) I d = C exp (cid:26) − µ (1 − α ) (cid:0) ( j − − α − ( i − − α (cid:1)(cid:27) I d . By (5.56), we have V ji L / (cid:0) U ji (cid:1) / C exp (cid:26) − µ − α ) (cid:0) ( j − − α − ( i − − α (cid:1)(cid:27) I d . (5.58)For α ∈ (0 , , by a simple calculation, n X j = i exp (cid:26) − µ − α ) (cid:0) j − α − i − α (cid:1)(cid:27) Ci α , and we have Q i ℓ i n − X j = i V j +1 i +1 Cℓ i n − X j = i exp (cid:26) − µ − α ) (cid:0) j − α − i − α (cid:1)(cid:27) I d Cℓ i i − α I d CI d . (5.59)Similarly, Q i < − CI d . This proves the second inequality of (5.36) for α ∈ (1 / , . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics For α = 1 , by (5.57), and note that P ∞ k =1 ℓ k ℓ , we have U ji C exp (cid:8) − µ (cid:0) log( j − − log( i − (cid:1)(cid:9) I d . Similar to (5.58) and (5.59), if i n − , we have Q i C ( i + 1) ℓ µ − { ϕ − ℓ µ ( n − − ϕ − ℓ µ ( i + 1) } I d ( CI d , ℓ µ > ,C (log n ) I d , ℓ µ = 1 . If i = 0 , the result (5.36) follows from the observation that Q = ℓ I d + ℓ (1 − ℓ G ) Q . This completes the proof of the upper bound. The lower bound can beshown similarly.

Appendix A: Proofs of some lemmas in Section 5

In the appendix, we give the proofs of some lemmas in Section 5.

A.1. Preliminary lemmas

To begin with, we introduce some preliminary lemmas. The ﬁrst lemma providesa moment inequality for sums of independent random vectors.

Lemma A.1.

Let ζ , . . . , ζ n ∈ R d be mean-zero independent random vectorsand S n = P ni =1 ζ i / √ n . Assume that max i n k ζ i k p a for some p > and a > . Then, k S n k p Ca , (A.1) where C > is a constant depending only on p . Let k·k ψ be the Orlicz normdeﬁned as in (3.19) and ψ = e x − . Assume that max i n k ζ i k ψ a forsome constant a > . Then, k S n k ψ Ca . (A.2) Proof.

Noting that p > , by the Hölder inequality, k ζ i k k ζ i k p a , and then k S n k k S n k = (cid:18) n n X i =1 k ζ i k (cid:19) / a . By the Hoﬀmann-Jørgensen inequality (see Talagrand [30, Theorem 1]), we havethere exists a constant C > depending only on p such that k S n k p C (cid:16) k S n k + n − / (cid:13)(cid:13)(cid:13) max i n k ζ i k (cid:13)(cid:13)(cid:13) p (cid:17) . (A.3) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics By the formula of integration by part, (cid:13)(cid:13)(cid:13) max i n k ζ i k (cid:13)(cid:13)(cid:13) pp = Z ∞ P (cid:16) max i n k ζ i k p > t (cid:17) dt n X i =1 Z ∞ P (cid:16) k ζ i k p > t (cid:17) dt = n X i =1 k ζ i k pp n max i n k ζ i k pp . (A.4)Substituting (A.4) to (A.3), we have there exist two constants C and C de-pending only on p such that k S n k p C (cid:16) a + n − / /p max i n k ζ i k p (cid:17) C a . This proves (A.1). Note that k ζ i k k ζ i k ψ , and then by a similar argumentand the Hoﬀmann-Jørgensen inequality for the k·k ψ norm (see Talagrand [30,Theorem 3]), the inequality (A.2) holds.We next introduce some notations of empirical process theory, following Vander Vaart and Wellner [32]. For any function class F , write k G n k F = sup f ∈F (cid:12)(cid:12) G n f (cid:12)(cid:12) , k P n − P k F = sup f ∈F (cid:12)(cid:12) P n f − P f (cid:12)(cid:12) . (A.5)By (3.1), it is easy to see k G n k F = √ n k P n − P k F . Let M δ := (cid:8) m θ − m θ ∗ : k θ − θ ∗ k δ (cid:9) . Then, k P n − P k M δ = sup θ : k θ − θ ∗ k δ (cid:12)(cid:12) ( M n − M )( θ ) − ( M n − M )( θ ∗ ) (cid:12)(cid:12) , k G n k M δ = sup θ : k θ − θ ∗ k δ √ n (cid:12)(cid:12) ( M n − M )( θ ) − ( M n − M )( θ ∗ ) (cid:12)(cid:12) . Note that k G n k M δ may not be measurable, and we need to consider its outerexpectation , see Van der Vaart and Wellner [32, Section 1.2] for a thoroughreference. Let E ∗ be the outer expectation operator and for any map Y , k Y k ∗ p = (cid:0) E ∗ {| Y | p } (cid:1) /p , k Y k ∗ ψ = inf n C > E ∗ n ψ (cid:16) | Y | C (cid:17)o o . Let P ∗ be the outer probability operator.The next lemma provides a bound on the bounds of kk G n k M δ k ∗ q and kk G n k M δ k ∗ ψ ,the proof is based on the empirical process theory. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Lemma A.2.

For q > , assume that (3.7) is satisﬁed with k m ( X ) k q a fora positive constant a . Then, we have (cid:13)(cid:13)(cid:13) k G n k M δ (cid:13)(cid:13)(cid:13) ∗ q Ca √ dδ, (A.6) where C > is a constant depending only on q . Assume further that there existsa constant a > such that k m ( X ) k ψ a , then we have (cid:13)(cid:13)(cid:13) k G n k M δ (cid:13)(cid:13)(cid:13) ∗ ψ Ca √ dδ, (A.7) where C > is an absolute constant.Proof. Recall that by (3.7), and k m ( X ) k k m ( X ) k q a , and we have M θ has an envelope F = δm such that k F ( X ) k q a δ . It has been shown (see,e.g., Van der Vaart [31, Chapters 5 and 19] and Wellner [34, Corollary 3.1]) that E ∗ k G n k M δ Ca d / δ, where C > is an absolute constant. By the Hoﬀmann-Jørgensen’s inequality(see Van der Vaart and Wellner [32, Theorem 2.14.5]), we have E ∗ k G n k q M δ C (cid:16)(cid:0) E ∗ k G n k M δ (cid:1) q + E | F | q (cid:17) Ca q d q δ q , where C > is a constant depending only on q . This proves (A.6). Inequality(A.7) follows from a similar argument. Lemma A.3.

Let G n be as in (A.5) and let F δ,j = (cid:8) h θ,j − h θ ∗ ,j : k θ − θ ∗ k δ (cid:9) for j n . (i) Assume that there exists a constant a > such that (3.14) is satisﬁedwith k h ( X ) k p a , then (cid:13)(cid:13)(cid:13) k G k F δ,j (cid:13)(cid:13)(cid:13) ∗ p Ca d / δ, (A.8) and (cid:13)(cid:13)(cid:13)(cid:13) sup k θ − θ ∗ k δ k Ψ n ( θ ) − Ψ( θ ) − Ψ n ( θ ∗ ) + Ψ( θ ∗ ) k (cid:13)(cid:13)(cid:13)(cid:13) ∗ p Cn − / a dδ, (A.9) where C > is a constant depending only on p . (ii) Assume that (3.14) is satisﬁed with k h ( X ) k ψ a , then (A.8) and (A.9) hold for all p > , and (cid:13)(cid:13)(cid:13)(cid:13) sup k θ − θ ∗ k δ k Ψ n ( θ ) − Ψ( θ ) − Ψ n ( θ ∗ ) + Ψ( θ ∗ ) k (cid:13)(cid:13)(cid:13)(cid:13) ∗ ψ Cn − / a d / δ, (A.10) where C > is an absolute constant. .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Proof.

For (i), note that (A.8) follows directly from Lemma A.2, (3.14) and (3.15).For (A.9), by (A.8) and the deﬁnitions of h θ and h θ,j , j = 1 , , . . . , n , E ∗ (cid:13)(cid:13)(cid:13)(cid:13) sup k θ − θ ∗ k δ k Ψ n ( θ ) − Ψ( θ ) − Ψ n ( θ ∗ ) + Ψ( θ ∗ ) k (cid:13)(cid:13)(cid:13)(cid:13) p d p/ − n − p/ d X j =1 E ∗ (cid:13)(cid:13)(cid:13) k G k F δ,j (cid:13)(cid:13)(cid:13) p Cn − p/ a p d p δ p , which proves (A.9).Observe that for any Y ∈ R d and p > , k Y k p C k Y k ψ . Hence, under the conditions (B1), (B4) and (B5), inequalities (A.8) and (A.9)follow by a similar arguments.For (A.10), by Van der Vaart and Wellner [32, Theorem 2.14.5], it followsfrom (3.14), (3.20) and (A.8) that (cid:13)(cid:13)(cid:13) k G k F δ,j (cid:13)(cid:13)(cid:13) ∗ ψ C (cid:16)(cid:13)(cid:13)(cid:13) k G k F δ,j (cid:13)(cid:13)(cid:13) ∗ + δ k h ( X ) k ψ (cid:17) Ca d / δ. By the triangle inequality, (cid:13)(cid:13)(cid:13)(cid:13) sup k θ − θ ∗ k δ k Ψ n ( θ ) − Ψ( θ ) − Ψ n ( θ ∗ ) + Ψ( θ ∗ ) k (cid:13)(cid:13)(cid:13)(cid:13) ∗ ψ n − / d X j =1 (cid:13)(cid:13)(cid:13) k G k F δ,j (cid:13)(cid:13)(cid:13) ∗ ψ Cn − / a d / δ, and hence (A.10) holds. A.2. Proof of Lemma

For each n , let A j,n = { θ : 2 j − < √ n k θ − θ ∗ k j } , j > . Recall that ˆ θ n truly minimizes M n ( θ ) , and thus, E ∗ (cid:8) √ n k ˆ θ n − θ ∗ k (cid:9) p X j > jp P ∗ (cid:0) ˆ θ n ∈ A j,n (cid:1) X j > jp P ∗ (cid:16) inf θ ∈ A j,n (cid:0) M n ( θ ) − M n ( θ ∗ ) (cid:1) (cid:17) . (A.11)By (3.6), we have inf θ ∈ A j,n (cid:0) M ( θ ) − M ( θ ∗ ) (cid:1) > µ inf θ ∈ A j,n k θ − θ ∗ k > µn − j − . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Set δ j = 2 j n − / . By Lemma A.2 and the Chebyshev inequality, we haveRHS of (A.11) X j > jp P ∗ (cid:18) k G n k M δj > µ j − √ n (cid:19) (cid:16) µ (cid:17) p +1 n p +12 X j > − j ( p +2) E ∗ n k G n k p +1 M δj o C (cid:16) a d / µ (cid:17) p +1 n p +12 X j > − j ( p +2) δ p +1 j C (cid:16) a µ (cid:17) p +1 d p +12 , where C > depends only on p . This completes the proof. A.3. Proof of Lemma

Write Y i = ¨ m θ ∗ ( X i ) − E { ¨ m θ ∗ ( X i ) } . By the Rosenthal inequality for randommatrices (see, e.g., Chen et al. [12, Theorem A.1]), and noting that Y i ’s aresymmetric ( d × d ) -matrices satisfying (3.9), E (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 Y i (cid:13)(cid:13)(cid:13)(cid:13) C (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) n X i =1 E Y i (cid:19) / (cid:13)(cid:13)(cid:13)(cid:13) + C E (cid:8) max i n k Y i k (cid:9) Cn − (cid:13)(cid:13)(cid:13)(cid:0) E (cid:8) ¨ m θ ∗ ( X ) (cid:9)(cid:1) / (cid:13)(cid:13)(cid:13) + Cn − E k ¨ m θ ∗ ( X ) k Cn − k m ( X ) k × k I d k + Cn − k m ( X ) k × k I d k Cn − c , where C > is an absolute constant and we use the fact that k I d k = 1 in thelast inequality. This proves (5.6). For H , by (3.8), we have E (cid:8) H (cid:9) max i n E { m ( X i ) } c . This completes the proof of (5.7) and hence the lemma.

A.4. Proof of Lemma

By (5.1) and (5.2) and the construction of ˆ θ ( i ) n , we have k V (ˆ θ n − ˆ θ ( i ) n ) k n (cid:13)(cid:13) ξ i − ξ ′ i (cid:13)(cid:13) + (cid:0) H k ˆ θ n − θ ∗ k + H ( i )1 k ˆ θ ( i ) n − θ ∗ k (cid:1) + (cid:0) H k ˆ θ n − θ ∗ k + H ( i )2 k ˆ θ ( i ) n − θ ∗ k (cid:1) , where ξ ′ i = ˙ m θ ∗ ( X ′ i ) is an independent copy of ξ i . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Note that ( ξ i , ˆ θ n , H , H ) has the same distribution as ( ξ ′ i , ˆ θ ( i ) n , H ( i )1 , H ( i )2 ) .By Lemma 5.2 with p = 8 and Lemma 5.3 and the Hölder inequality, we have E k V (ˆ θ n − ˆ θ ( i ) n ) k (cid:18) n − E k ξ i k + E { H k ˆ θ n − θ ∗ k } + E { H k ˆ θ n − θ ∗ k } (cid:19) (cid:18) n − k ξ i k + k H k k ˆ θ n − θ ∗ k + k H k k ˆ θ n − θ ∗ k (cid:19) Cn − (cid:16) c d + µ − / c / c d / + µ − / c / c d / (cid:17) , where C > is an absolute constant. The result (5.8) immediately follows fromthe condition that λ min ( V ) > λ . A.5. Proof of Lemma

The inequality (5.17) follows from Lemma A.1 and (3.17), Note that by (A.9),we have E ∆ λ − n E ∗ (cid:13)(cid:13)(cid:13)(cid:13) sup θ : k θ − θ ∗ k δ n k Ψ n ( θ ) − Ψ( θ ) − Ψ n ( θ ∗ ) + Ψ( θ ∗ ) k (cid:13)(cid:13)(cid:13)(cid:13) Cλ − c d δ n . This proves (5.18).By (3.12), (3.11) and the Cauchy inequality, we have µ k ˆ θ n − θ ∗ k (cid:10) ˆ θ n − θ ∗ , Ψ(ˆ θ n ) − Ψ( θ ∗ ) (cid:11) = − (cid:10) ˆ θ n − θ ∗ , Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:11) − (cid:10) ˆ θ n − θ ∗ , (cid:0) Ψ n (ˆ θ n ) − Ψ(ˆ θ n ) (cid:1) − (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1)(cid:11) k ˆ θ n − θ ∗ kk Ψ n ( θ ∗ ) − Ψ( θ ∗ ) k + k ˆ θ n − θ ∗ k (cid:13)(cid:13)(cid:0) Ψ n (ˆ θ n ) − Ψ(ˆ θ n ) (cid:1) − (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1)(cid:13)(cid:13) , which implies k ˆ θ n − θ ∗ k µ k Ψ n ( θ ∗ ) − Ψ( θ ∗ ) k + 1 µ (cid:13)(cid:13)(cid:0) Ψ n (ˆ θ n ) − Ψ(ˆ θ n ) (cid:1) − (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1)(cid:13)(cid:13) . (A.12)By (3.17) and applying Lemma A.1 to Ψ n ( θ ∗ ) − Ψ( θ ∗ ) , we have k Ψ n ( θ ∗ ) − Ψ( θ ∗ ) k p Cc d / n − / . (A.13)Taking expectations on both sides of (A.12), by (A.13) and Lemma A.3, weobtain k ˆ θ n − θ ∗ k p µ k Ψ n ( θ ∗ ) − Ψ( θ ∗ ) k p + 1 µ (cid:13)(cid:13)(cid:13) sup θ ∈ Θ (cid:13)(cid:13)(cid:0) Ψ n ( θ ) − Ψ( θ ) (cid:1) − (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ p Cµ − ( c d / n − / + c dn − / D Θ ) C ( D Θ + 1) dn − / . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics This proves (5.19).Now we move to prove (5.20). By (3.11), we have

Ψ(ˆ θ n ) − Ψ( θ ∗ ) = − (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1) − (cid:0) Ψ n (ˆ θ n ) − Ψ(ˆ θ n ) (cid:1) + (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1) , and Ψ(ˆ θ ( i ) n ) − Ψ( θ ∗ ) = − (cid:0) Ψ ( i ) n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1) − (cid:0) Ψ ( i ) n (ˆ θ ( i ) n ) − Ψ(ˆ θ ( i ) n ) (cid:1) + (cid:0) Ψ ( i ) n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1) . On the event that k ˆ θ n − θ ∗ k δ and k ˆ θ ( i ) n − θ ∗ k δ , taking diﬀerence of theforegoing terms, and by (3.12) again, we have µ k ˆ θ n − ˆ θ ( i ) n k (cid:0) k ˆ θ n − θ ∗ k δ, k ˆ θ ( i ) n − θ ∗ k δ (cid:1) (cid:13)(cid:13) Ψ(ˆ θ n ) − Ψ(ˆ θ ( i ) n ) (cid:13)(cid:13) (cid:0) k ˆ θ n − θ ∗ k δ, k ˆ θ ( i ) n − θ ∗ k δ (cid:1) n k ξ i − ξ ′ i k + 2 sup θ : k θ − θ ∗ k δ (cid:13)(cid:13)(cid:0) Ψ n ( θ ) − Ψ( θ ) (cid:1) − (cid:0) Ψ n ( θ ∗ ) − Ψ( θ ∗ ) (cid:1)(cid:13)(cid:13) . By (3.17) and Lemma A.3, µ E {k ˆ θ n − ˆ θ ( i ) n k p (cid:0) k ˆ θ n − θ ∗ k δ, k ˆ θ ( i ) n − θ ∗ k δ (cid:1) } Cd p/ n − p + Cd p n − p/ δ p , and then we complete the proof of (5.20). A.6. Proof of Lemma

For (5.21), note that ξ i and ˆ θ ( i ) n are independent, and ˆ θ n has the same distribu-tion as ˆ θ ( i ) n , E (cid:8) k ξ i kk ˆ θ n − θ ∗ k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B cδ (cid:1)(cid:9) δ E (cid:8) k ξ i k (cid:0) ˆ θ ( i ) n ∈ B cδ (cid:1)(cid:9) δ E k ξ i k P (cid:0) k ˆ θ ( i ) n − θ ∗ k > δ (cid:1) Cd / δ P (cid:0) k ˆ θ n − θ ∗ k > δ (cid:1) Cd / δ − p +2 E (cid:8) k ˆ θ n − θ ∗ k p (cid:9) C ( D Θ + 1) p d p +1 / δ − p +2 n − p/ , where we used (5.19) in the last inequality.For (5.22), by the Hölder inequality, we have E (cid:8) k ξ i kk ˆ θ ( i ) n − θ ∗ k (cid:0) ˆ θ n ∈ B cδ , ˆ θ ( i ) n ∈ B δ (cid:1)(cid:9) (cid:16) E (cid:8) k ξ i k p/ k ˆ θ ( i ) n − θ ∗ k p (cid:9)(cid:17) /p (cid:16) P (cid:0) k ˆ θ n − θ ∗ k > δ (cid:1)(cid:17) ( p − /p δ − p +2 (cid:16) E (cid:8) k ξ i k p/ (cid:9) E {k ˆ θ ( i ) n − θ ∗ k p } (cid:17) /p (cid:16) E (cid:8) k ˆ θ n − θ ∗ k p (cid:9)(cid:17) ( p − /p C ( D Θ + 1) p d p +1 / δ − p +2 n − p/ . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics As for (5.23), recalling that p > , and by (3.17), (5.19) and (5.20) and theHölder inequality, we have E n k ξ i k ( k ˆ θ n − θ ∗ k + k ˆ θ ( i ) n − θ ∗ k ) k ˆ θ n − ˆ θ ( i ) n k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B δ (cid:1)o k ξ i k p k ˆ θ n − θ ∗ k p (cid:13)(cid:13)(cid:13) k ˆ θ n − ˆ θ ( i ) n k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B δ (cid:1)o(cid:13)(cid:13)(cid:13) p + k ξ i k p k ˆ θ ( i ) n − θ ∗ k p (cid:13)(cid:13)(cid:13) k ˆ θ n − ˆ θ ( i ) n k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B δ (cid:1)o(cid:13)(cid:13)(cid:13) p = 2 k ξ i k p k ˆ θ n − θ ∗ k p (cid:13)(cid:13)(cid:13) k ˆ θ n − ˆ θ ( i ) n k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B δ (cid:1)o(cid:13)(cid:13)(cid:13) p C ( D Θ + 1) (cid:16) d n − / + d / n − δ (cid:17) . This completes the proof.

A.7. Proof of Lemma

In this proof, we denote by C a positive constant that depends only on c , c , c , µ , λ and λ and C p a constant that also depends on p , which mighttake diﬀerent values in diﬀerent places. By (ii) of Lemma A.3 and following theproof of Lemma 5.6, we have (5.19) and (5.20) also hold for a positive constant C p . This proves the ﬁrst argument of this lemma. Note that Ψ n ( θ ∗ ) − Ψ( θ ∗ ) = n − P ni =1 ξ i , and by Lemma A.1 and (3.21), k Ψ n ( θ ∗ ) − Ψ( θ ∗ ) k ψ Cc d / n − / , (A.14)and for any p > , k Ψ n ( θ ∗ ) − Ψ( θ ∗ ) k p C p c d / n − / . For (5.29), by the fact that P ( k Y k > t ) e − t/ k Y k ψ , it follows from (A.14)and Lemma A.3 that P (cid:0) k Ψ n ( θ ∗ ) − Ψ( θ ∗ ) k > µt/ (cid:1) (cid:16) − C ′′ √ nµtc (cid:17) , and P ∗ (cid:18) sup θ ∈ Θ k Ψ n ( θ ) − Ψ( θ ) − Ψ n ( θ ∗ ) + Ψ( θ ∗ ) k > µt/ (cid:19) (cid:16) − C ′′ √ ntc d / ( D Θ + 1) (cid:17) , where C ′′ > is an absolute constant. By (A.12), for any t > , P (cid:0) k ˆ θ n − θ ∗ k > t (cid:1) P (cid:0) k Ψ n ( θ ∗ ) − Ψ( θ ∗ ) k > µt/ (cid:1) + P ∗ (cid:18) sup θ ∈ Θ k Ψ n ( θ ) − Ψ( θ ) − Ψ n ( θ ∗ ) + Ψ( θ ∗ ) k > µt/ (cid:19) (cid:16) − C ′′ √ nµtc d / ( D Θ + 1) + c d / (cid:17) . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Taking C ′ = C ′′ µ/ ( c + c ) , we complete the proof of (5.29). By Lemma 5.9 andthe Hölder inequality, we have E n k ξ i kk ˆ θ n − θ ∗ k (cid:0) ˆ θ n ∈ B δ , ˆ θ ( i ) n ∈ B cδ (cid:1)o (cid:0) E k ξ i k (cid:1) / (cid:16) E k ˆ θ n − θ ∗ k (cid:17) / (cid:16) P (cid:0) ˆ θ ( i ) n ∈ B cδ (cid:1)(cid:17) / C ( D Θ + 1) d / n − exp (cid:16) − C ′ n / δ D Θ + 1) d / (cid:17) , where we used (3.21), (5.19) and (5.29) in the last line. This proves (5.30). Theinequality (5.31) can be derived using a similar argument. A.8. Proof of Lemma

We use a recursion inequality to prove the bound. By (3.22), we have k θ n − θ ∗ k = k θ n − − θ ∗ k − ℓ n (cid:10) θ n − − θ ∗ , ∇ f ( θ n − ) + ζ n (cid:11) + ℓ n k∇ f ( θ n − ) + ζ n k . (A.15)Recall that by (C1), ζ n = ξ n + η n where ξ n is independent of θ n − , η n is F n measurable, k η n k c k θ n − − θ ∗ k , E { η n |F n − } = 0 and θ n − ∈ F n − .Therefore, E (cid:8)(cid:10) ∇ f ( θ n − ) , ζ n (cid:11) (cid:12)(cid:12) F n − (cid:9) = 0 . Moreover, with L := c + L, E (cid:8)(cid:13)(cid:13) ∇ f ( θ n − ) + ζ n (cid:13)(cid:13) (cid:12)(cid:12) F n − (cid:9) = E (cid:8)(cid:13)(cid:13) ∇ f ( θ n − ) (cid:13)(cid:13) (cid:12)(cid:12) F n − (cid:9) + E (cid:8)(cid:13)(cid:13) ζ n (cid:13)(cid:13) (cid:12)(cid:12) F n − (cid:9) = E (cid:8)(cid:13)(cid:13) ∇ f ( θ n − ) − ∇ f ( θ ∗ ) (cid:13)(cid:13) (cid:12)(cid:12) F n − (cid:9) + E (cid:8)(cid:13)(cid:13) ξ n + η n (cid:13)(cid:13) (cid:12)(cid:12) F n − (cid:9) L k θ n − − θ ∗ k + 2 E k ξ n k . (A.16)For the intersection term of (A.15), under the strong convexity assumption(3.24), (cid:10) ∇ f ( θ ) − ∇ f ( θ ) , θ − θ (cid:11) > µ k θ − θ k , and noting that E {h θ n − − θ ∗ , ζ n i|F n − } = 0 , E (cid:8)(cid:10) θ n − − θ ∗ , ∇ f ( θ n − ) + ζ n (cid:11) (cid:12)(cid:12) F n − (cid:9) = E (cid:10) θ n − − θ ∗ , ∇ f ( θ n − ) − ∇ f ( θ ∗ ) (cid:11) > µ k θ n − − θ ∗ k . (A.17)Combining (A.15)–(A.17), E (cid:8) k θ n − θ ∗ k (cid:12)(cid:12) F n − (cid:9) (cid:0) − µℓ n + 2 L ℓ n (cid:1) k θ n − − θ ∗ k + 2 ℓ n E k ξ n k . Taking expectations on both sides yields δ n (cid:0) − µℓ n + 2 L ℓ n (cid:1) δ n − + 2 ℓ n E k ξ n k . (A.18) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics Observe that µ L and thus µℓ n − L ℓ n L ℓ n − L ℓ n / . Thisensures that the coeﬃcient in front of δ n − is always positive. By (C0), we have δ = E k θ − θ ∗ k τ . (A.19)By (A.18) and (A.19) and applying the recursion n times, recalling that max i n E k ξ i k τ , we have δ n n Y k =1 (cid:0) − µℓ k + 2 L ℓ k (cid:1) δ + 2 τ n X k =1 n Y i = k +1 (cid:0) − µℓ i + 2 L ℓ i (cid:1) ℓ k n Y k =1 (cid:0) − µℓ k + 2 L ℓ k (cid:1) τ + 2 τ n X k =1 n Y i = k +1 (cid:0) − µℓ i + 2 L ℓ i (cid:1) ℓ k . Following the proof of Bach and Moulines (2011, Eqs. (18), (23) and (24)), wehave δ n (cid:16) τ + τ L (cid:17) exp (cid:18) − µ n X k =1 ℓ k + 4 L n X k =1 ℓ k (cid:19) + 2 τ n X k =1 n Y i = k +1 (1 − µℓ i ) ℓ k (A.20) (cid:16) τ + τ L (cid:17) exp (cid:18) − µ n X k =1 ℓ k + 4 L n X k =1 ℓ k (cid:19) + 2 τ (cid:26) exp (cid:18) − µ n X i = n/ ℓ i (cid:19) n X k =1 ℓ k + ℓ n/ µ (cid:27) . (A.21)We next consider the cases where α ∈ (1 / , and α = 1 , separately. If α ∈ (1 / , , by (A.19) and (A.21), we have n X k =1 ℓ k ℓ ∞ X k =1 k − α C, and δ n C exp (cid:0) − µℓ n − α (cid:1)(cid:16) τ + τ L (cid:17) + 4 ℓ τ µn α Cn − α ( τ + τ ) . This proves (5.42). Now we move to prove (5.43). For α = 1 , i.e., ℓ i = ℓ i − , weuse the following basic inequalities: log n n X k =1 k − log n + 1 , ∞ X k =1 k − . .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics For the ﬁrst term of (A.20), we have exp (cid:18) − µ n X k =1 ℓ k + 4 L n X k =1 ℓ k (cid:19) exp (cid:0) ℓ L (cid:1) exp (cid:0) − µℓ log n (cid:1) Cn − µℓ , and for the second term of (A.20), we obtain n X k =1 n Y i = k +1 (1 − µℓ i ) ℓ k ℓ n X k =1 k − exp n − µℓ n X i = k +1 i − o ℓ e µℓ n X k =1 k − exp (cid:8) − µℓ log n + µℓ log k (cid:9) ℓ e µℓ n − µℓ n X k =1 k − µℓ  Cn − , µℓ > ,Cn − log n, µℓ = 1 ,Cn − µℓ , < µℓ < . Therefore, for α = 1 , δ n  Cn − ( τ + τ ) , µℓ > ,Cn − (log n )( τ + τ ) , µℓ = 1 ,Cn − µℓ ( τ + τ ) , < µℓ < . This proves (5.43).

A.9. Proof of Lemma

By the construction of ( θ ( i ) j ) j n , θ j − θ ( i ) j =  , j < i ; − ℓ j ( ξ j − ξ ′ j + η j − η ( i ) j ) , j = i ; (cid:0) θ j − − θ ( i ) j − (cid:1) − ℓ j (cid:0) ∇ f ( θ j − ) − ∇ f ( θ ( i ) j − ) + η j − η ( i ) j (cid:1) , j > i. Since ξ i d = ξ ′ i , η i d = η ( i ) i and k η i k c k θ i − − θ ∗ k , it follows from (C1) andLemma 5.12 that E k θ i − θ ( i ) i k Cℓ i ( E k ξ i k + E k η i k ) Ci − α ( τ + c E k θ i − − θ ∗ k ) Ci − α ( τ + τ ) . (A.22) .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics For j > i , using a similar argument leading to (A.18), E k θ j − θ ( i ) j k = E k θ j − − θ ( i ) j − k − ℓ j E n θ j − − θ ( i ) j − , ∇ f ( θ j − ) − ∇ f ( θ ( i ) j − ) o + ℓ j E k∇ f ( θ j − ) − ∇ f ( θ ( i ) j − ) + η j − η ( i ) j k (cid:0) − µℓ j + 2 L ℓ j (cid:1) E k θ j − − θ ( i ) j − k , Solving the recursive system, and by (A.22), we have E k θ j − θ ( i ) j k exp (cid:26) − µ j X k = i +1 ℓ k + 2 L j X k =1 ℓ k (cid:27) E k θ i − θ ( i ) i k C ( τ + τ ) i − α exp (cid:26) − µ j X k = i +1 ℓ k (cid:27) . For < α < , using a similar argument as in the proof of Theorem 3.4, wehave for j > i , E k θ j − θ ( i ) j k C ( τ + τ ) i − α exp n − µ (cid:0) ϕ − α ( j ) − ϕ − α ( i ) (cid:1)o ( C ( τ + τ ) i − α , i j i,C ( τ + τ ) i − α exp n − µ (cid:0) ϕ − α ( j ) − ϕ − α ( j ) (cid:1)o , j > i, ( C ( τ + τ )( j/ − α , i j i,C ( τ + τ ) i − α exp n − µ ϕ − α ( j ) o , j > i C ( τ + τ ) j − α . If α = 1 , (5.46) can be shown similarly. A.10. Proof of Lemma

Recall that for any j > , θ j − θ ∗ = ( θ j − − θ ∗ ) − ℓ j (cid:0) ∇ f ( θ j − ) + ζ j (cid:1) , where ζ j is a martingale diﬀerence such that E { ζ j |F j − } = 0 , and θ j − is F j − -measurable. Hence, k θ j − θ ∗ k = k θ j − − θ ∗ k + 4 ℓ j (cid:10) θ j − − θ ∗ , ∇ f ( θ j − ) + ζ j (cid:11) + ℓ j k∇ f ( θ j − ) + ζ j k − ℓ j k θ j − − θ ∗ k (cid:10) θ j − − θ ∗ , ∇ f ( θ j − ) + ζ j (cid:11) + 2 ℓ j k θ j − − θ ∗ k k∇ f ( θ j − ) + ζ j k − ℓ j (cid:10) θ j − − θ ∗ , ∇ f ( θ j − ) + ζ j (cid:11) k∇ f ( θ j − ) + ζ j k k θ j − − θ ∗ k + 6 ℓ j k θ j − − θ ∗ k k∇ f ( θ j − ) + ζ j k + ℓ j k∇ f ( θ j − ) + ζ j k + 4 ℓ j k θ j − − θ ∗ kk∇ f ( θ j − ) + ζ j k − ℓ j k θ j − − θ ∗ k (cid:10) θ j − − θ ∗ , ∇ f ( θ j − ) + ζ j (cid:11) , .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics and then E (cid:8) k θ j − θ ∗ k |F j − (cid:9) k θ j − − θ ∗ k + 6 ℓ j k θ j − − θ ∗ k E (cid:8) k∇ f ( θ j − ) + ζ j k (cid:12)(cid:12) F j − (cid:9) + ℓ j E (cid:8) k∇ f ( θ j − ) + ζ j k (cid:12)(cid:12) F j − (cid:9) + 4 ℓ j k θ j − − θ ∗ k E (cid:8) k∇ f ( θ j − ) + ζ j k (cid:12)(cid:12) F j − (cid:9) − ℓ j k θ j − − θ ∗ k (cid:10) θ j − − θ ∗ , ∇ f ( θ j − ) − ∇ f ( θ ∗ ) (cid:11) . (A.23)Note that by (3.24), (cid:10) θ j − − θ ∗ , ∇ f ( θ j − ) − ∇ f ( θ ∗ ) (cid:11) > µ k θ j − − θ ∗ k . (A.24)For p , recall that by (C1), E {k ξ j k p } τ p and k η j k c k θ j − − θ ∗ k ,and it follows that E (cid:8) k∇ f ( θ j − ) + ζ j k p (cid:12)(cid:12) F j − (cid:9) E (cid:8) k∇ f ( θ j − ) − ∇ f ( θ ∗ ) + ξ j + η j k p (cid:12)(cid:12) F j − (cid:9) p − (cid:16) L p k θ j − − θ ∗ k p + τ p (cid:17) . (A.25)By (A.23)–(A.25), we have E (cid:8) k θ j − θ ∗ k (cid:12)(cid:12) F j − (cid:9) k θ j − − θ ∗ k (cid:0) − µℓ j + 12 ℓ j L + 16 ℓ j L + 8 ℓ j L (cid:1) + 20 k θ j − − θ ∗ k ℓ j τ + 16 ℓ j τ k θ j − − θ ∗ k (cid:0) − µℓ j + 16 ℓ j L + 24 ℓ j L (cid:1) + 20 k θ j − − θ ∗ k ℓ j τ + 16 ℓ j τ . Taking expectations on both sides, we have E k θ j − θ ∗ k E k θ j − − θ ∗ k (cid:0) − µℓ j + 16 ℓ j L + 24 ℓ j L (cid:1) + C ( τ + τ ) τ j − α + 16 τ j − α , where we used Lemma 5.12 in the last inequality. Using the similar argumentsleading to Lemma 5.12 (see also Bach and Moulines (2011, pp. 16–19)), we havefor α ∈ (0 , , E k θ j − θ ∗ k Cj − α ( τ + τ ) . If α = 1 , we have E k θ j − θ ∗ k Cj − µℓ (cid:0) ϕ µℓ − ( j ) + 1 (cid:1) ( τ + τ ) , where ϕ is as deﬁned in (5.32). This proves (5.48), and hence the lemma. References [1] Anastasiou, A., Balasubramanian, K. and Erdogdu, M. A. (2019). Normal ap-proximation for stochastic gradient descent via non-asymptotic rates of martin-gale CLT. In

Proceedings of the Thirty-Second Conference on Learning TheoryPMLR .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics [2] Bach, F. and Moulines, E. (2011). Non-asymptotic analysis of stochastic approx-imation algorithms for machine learning. In Neural Information Processing Sys-tems (NIPS) , Spain.[3] Barbour, A. (1990). Stein’s method for diﬀusion approximations.

Probab. TheoryRelat. Fields (3) 297–322.[4] Bentkus, V. (2003). On the dependence of the Berry–Esseen bound on dimension. J. Stat. Plan. (2) 385–402.[5] Bentkus, V. (2005). A Lyapunov-type bound in R d . Theory Probab. Appl. (2)311–323.[6] Bentkus, V., Bloznelis, M. and Götze, F. (1997). A Berry–Esséen bound for M-estimators. Scand. J. Stat. (4) 485–502.[7] Bhattacharya, R. and Holmes, S. (2010). An exposition of Götze’s estimation ofthe rate of convergence in the multivariate central limit theorem. Technical report

Stanford University.[8] Chatterjee, S. and Meckes, E. (2008). Multivariate normal approximation usingexchangeable pairs.

ALEA-Latin Am. J. Probab. Math. Stat. Bernoulli (2) 581–599.[10] Chen, L. H. Y. and Fang, X. (2015). Multivariate normal approximation by Stein’smethod: The concentration inequality approach. Available at arXiv:1111.4073.[11] Chen, L. H. Y., Goldstein, L. and Shao, Q.-M. (2011). Normal approximation byStein’s method . Probability and its Applications, New York: Springer, Heidelberg.[12] Chen, R. Y., Gittens, A. and Tropp, J. A. (2012). The masked sample covarianceestimator: An analysis using matrix concentration inequalities.

Information andInference (1) 2–20.[13] Goldstein, L. and Rinott, Y. (1996). Multivariate normal approximations byStein’s method and size bias couplings. J. Appl. Probab. (01) 1–17.[14] Götze, F. (1991). On the rate of convergence in the multivariate CLT. Ann.Probab. (2) 724–739.[15] Huber, P. J. (1967). The behavior of maximum likelihood estimates under non-standard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathe-matical Statistics and Probability Proceedings of the Third Japan — USSR Symposium onProbability Theory

Lecture Notes in Mathematics pp. 419–438, Berlin, Heidelberg:Springer.[17] Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2009). Robust stochasticapproximation approach to stochastic programming.

SIAM J. Optim. (4) 1574–1609.[18] Paulauskas, V. (1996). Rates of convergence in the asymptotic normality for somelocal maximum estimators. Lith. Math. J. (1) 68–91.[19] Pfanzagl, J. (1971). The Berry-Esseen bound for minimum contrast estimates. Metrika (1) 82–91.[20] Pfanzagl, J. (1972). Further results on asymptotic normality I. Metrika (1)174–198.[21] Pfanzagl, J. (1973). The accuracy of the normal approximation for estimates of .-M. Shao and Z.-S. Zhang/Berry–Esseen bounds for multivariate nonlinear statistics vector parameters. Z. Wahrscheinlichkeitstheorie verw. Geb (3) 171–198.[22] Pollard, D. (1985). New ways to prove central limit theorems. Econ. Theory (3)295–313.[23] Polyak, B. T. (1990). New stochastic approximation type procedures. Avtomaticai Telemekhanika SIAM J. Control and Optimization (4) 838–855.[25] Raič, M. (2019). A multivariate Berry–Esseen theorem with explicit constants. Bernoulli (4A) 2824–2853.[26] Reinert, G. and Röllin, A. (2009). Multivariate normal approximation with Stein’smethod of exchangeable pairs under a general linearity condition. Ann. Probab. (6) 2150–2173.[27] Ruppert, D. (1988). Eﬃcient estimations from a slowly convergent Robbins–Monro process. Technical Report

Cornell University Operations Research andIndustrial Engineering.[28] Senatov, V. V. (1981). Uniform estimates of the rate of convergence in the multi-dimensional central limit theorem.

Theory Probab. its Appl. (4) 745–759.[29] Shao, Q.-M. and Zhou, W.-X. (2016). Cramér type moderate deviation theoremsfor self-normalized processes. Bernoulli (4), pp. 2029–2079.[30] Talagrand, M. (1989). Isoperimetry and integrability of the sum of independentBanach-space valued random variables. Ann. Probab. (4) 1546–1570.[31] Van der Vaart, A. W. (1998). Asymptotic statistics . Cambridge Series in Statisticaland Probabilistic Mathematics, Cambridge University Press.[32] Van der Vaart, A. W. and Wellner, J. A. (1996).

Weak convergence and empiricalprocesses . Springer Series in Statistics Springer New York.[33] Vershynin, Roman. (2010).

Introduction to the non-asymptotic analysis of randommatrices . Compressed Sensing.[34] Wellner, J. A. (2005). Empirical processes: Theory and applications.