[PDF] Estimating High-Dimensional Discrete Choice Model of Differentiated Products with Random Coefficients

Abstract

We propose an estimation procedure for discrete choice models of differentiated products with possibly high-dimensional product attributes. In our model, high-dimensional attributes can be determinants of both mean and variance of the indirect utility of a product. The key restriction in our model is that the high-dimensional attributes affect the variance of indirect utilities only through finitely many indices. In a framework of the random-coefficients logit model, we show a bound on the error rate of a l 1 -regularized minimum distance estimator and prove the asymptotic linearity of the de-biased estimator.

Full PDF

aa r X i v : . [ ec on . E M ] A p r E STIMATING H IGH -D IMENSIONAL D ISC R ETE C HOIC E M ODELOF D IFFER ENTIATED P RODUCTS WITH R ANDOM C OEFFIC IENTS

Masayuki Sawada

Institute of Economic Research,Hitotsubashi University [email protected]

Kohei Kawaguchi

Department of Economics,Hong Kong University of Science and Technology [email protected]

April 21, 2020 A BSTRACT

We propose an estimation procedure for discrete choice models of differentiated products with pos-sibly high-dimensional product attributes. In our model, high-dimensional attributes can be deter-minants of both mean and variance of the indirect utility of a product. The key restriction in ourmodel is that the high-dimensional attributes affect the variance of indirect utilities only throughﬁnitely many indices. In a framework of the random-coefﬁcients logit model, we show a bound onthe error rate of a l -regularized minimum distance estimator and prove the asymptotic linearity ofthe de-biased estimator. There are many occasions in which high-dimensional product attributes are available for estimating and predicting de-mands for differentiated products, especially in context where machine learning techniques such as pattern recognitionand natural language processing can generate high-dimensional representation of product attributes. For example, wecan consider a consumer choice over clothes. In addition to the typical characteristics of prices, countries of origin,and materials, there are numerous varieties of design patterns that can inﬂuence a consumer’s choice. The choiceover books is dependent on their contents. With a natural language processing technique, we can potentially representtheir contents with a high-dimensional semantic vectors. In this paper, we investigate a framework that enables us tointegrate these potentially informative but not-yet-fully-used product attributes in estimation, inference, and predictionof demands for differentiated products.We are not the ﬁrst considering the estimation of high-dimensional discrete choice model. Gillen et al. (2019) studiedan estimation of a model that extends Berry et al. (1995, henceforth BLP)’s random-coefﬁcients discrete-choice modelwith high-dimensional attributes and applied to the study of political campaign for elections. In their model, theproduct attributes can grow exponentially relative to the sample size; however, random coefﬁcient is only allowed forthe price. Under this assumption, the high-dimensional attributes can only affect the mean indirect utility. Thus, theycan apply a l -regularized least squares method to the inverted mean indirect utility to select relevant variables.Recent methodological developments in statistics and econometrics allows us to apply a high-dimensional estimationprocedure to a broader class of models. In particular, the recent work of Belloni et al. (2018) provides a set of usefulresults for l -regularized minimum distance estimation following the techniques developed by Frank and Friedman(1993) and Tibshirani (1996). We derive the property of our l -regularaized minimum distance estimator and its debi-ased version based on their results. Because the objective function of the BLP model is non-linear in the parameters,we use the contraction inequality theorem of Ledoux and Talagrand (1991) to control the tail probability of the esti-mation error. This contraction inequality exploits the Lipschitz continuity of the objective function with respect to asingle index. In the context of the BLP discrete choice model with non-linear but smooth moment conditions, we showthat the analogue principle can be applied to the case with multiple indices. This allows us to have ﬁxed number ofrandom coefﬁcients on the indices of potentially high-dimensional attributes. PREPRINT - A

PRIL

21, 2020The rest of the paper proceeds in the following manner. In the next section, we describe our model and introduce theregularized GMM (RGMM) problem for our model. In section 3, we show the probability bound for the estimationerror from the regularized GMM problem. In section 4, we consider the de-biased procedure for the proper inference.The last section concludes.

Consider there are J products in each market i ∈ { , . . . , n } . Let us denote [ · ] for an integer indicates index set { , . . . , ·} . Each product j shares a non-zero demand S ij ∈ (0 , in each market i . Each product j in a market i hasobserved L attributes { x ijl } l ∈ [ L ] including cost of attaining the product j , − p ij , and an unobserved attribute ξ ij .In this paper, we consider the high-dimensionality in the product attributes x ijl . The key restriction is that we assumethere are G known ﬁnite partitions of the high-dimensional product characteristics [ L ] . In particular, we consider thefollowing indirect utility for a product j in a market i : u ij = X g ∈ [ G ] X l ∈ L g x ijl ( β l + γ l ˜ β g ) + ξ ij + ǫ ij where ǫ ij is an idiosyncratic error term and L g is mutually exclusive subset of [ L ] for each g ∈ [ G ] such that ∪ g ∈ [ G ] [ L g ] = [ L ] . Note that P g ∈ [ G ] P l ∈ L g x ijl β l + ξ ij = x ′ ij β + ξ ij with β = ( β , . . . , β L ) ′ captures the meanutility of the product j in the market i . Also, P g ∈ [ G ] P l ∈ L g x ijl γ l ˜ β g = P g ∈ [ G ] x ′ ijg γ g ˜ β g with x ijg = [ x ijl ] ′ l ∈ L g and γ g = [ γ l ] ′ l ∈ L g and ˜ β g ∼ iid N (0 , capture the individual heterogeneity in the utility for the product j in the market i . We assume ˜ β g is independent of ˜ β g ′ , and the group speciﬁc variance of the random coefﬁcient ˜ β g is normalizedto 1 for each corresponding vector γ g . Therefore, the variance of the indirect utility of the product j in the market i is P g ∈ [ G ] ( P l ∈ [ L g ] x ijl γ l ) . This ﬁnite indices restriction allows us to apply the contraction inequality theorem,Ledoux and Talagrand (1991), which is in principle applied to a Lipschitz transformation of a single index. If therandom coefﬁcients are not restricted, then the number of indices grows to inﬁnity as the number of attributes growsinﬁnity, and we cannot apply the contraction inequality to bound the estimation error. For the reminder of the paper,let g ( l ) represents the group g ∈ [ G ] that the attribute l belongs to.The mean utility is the same as usual differentiated product demand model such as the BLP model. The heterogeneityterm is different from the usual random coefﬁcient model. This model can be seen as a special case of the usualrandom coefﬁcient model that the product attributes x ijl and x ijl ′ share the same individual preference shock β g if l, l ′ ∈ L g . In other words, consumers observe the set of characteristics { x ijl } l ∈ L g as a common index characteristicsof x ′ ijg γ g , but individuals may have different preference over the index x ′ ijg γ g . In the example of the design pattern,we share the same objective descriptions of the design, but the subjective preference over the descriptions differ acrosspeople. An important feature is that we do not restrict β l = γ l . Therefore, the mean P l ∈ L g x ijl β l and the variance ( P l ∈ L g x ijl γ l ) of each utility load from a group of characteristics L g may be unrelated each other. Suppose that ǫ ij follows iid Type-I extreme value distribution, then the parameters θ ≡ { β ′ , γ ′ } ′ and observed charac-teristics x ij pin down the share of the product j in the market i as the function of mean utility vector x ′ i β + ξ i suchthat s j ( x i , x ′ i β + ξ i ; θ ) = Z ˜ β exp( x ′ ij β + ξ ij + P g x ′ ijg γ g ˜ β g )1 + P j ′ ∈ [ J ] exp( x ′ ij ′ β + ξ ij ′ + P g x ′ ij ′ g γ g ˜ β g ) dF ˜ β = S ij where F ∼ N [ G ] (0 , I [ G ] ) and S ij denotes the population share of the product j in the market i . Here we assume that the unobserved product type ξ ij is mean independent of a vector of instruments w ij , E [ ξ ij | w ij ] = 0 where the expectation is taken over the markets i for each j ∈ [ J ] . This conditional moment restriction leads to a setof unconditional moment conditions with transformed vector of K instruments for each product j , h jk ( w ij ) , k ∈ [ K ] , To simplify the argument, we ignore the measurement error issue of the observed market share from the population share fornow. PREPRINT - A

PRIL

21, 2020such that E [ ξ ij h jk ( w ij )] = 0 for each j ∈ [ J ] . Berry (1994) shows that there exists unique inverse functions of the share s j ( x, · ; θ ) , such that s j ( x i , s − ( x i , S i ; θ ); θ ) = S ij . (1)Therefore, the moment condition is now E [( s − j ( x i , S i ; θ ) − x ′ i β ) h jk ( w ij )] = 0 , ∀ j ∈ [ J ] , k ∈ [ K ] . Now let ξ j (˜ x ; θ ) ≡ s − j ( x, S ; θ ) − x ′ ij β where ˜ x ≡ ( x ′ , S ′ , w ′ ) ′ . Following Belloni et al. (2018), we consider a regularized GMM approach for the moment condition above. In partic-ular, let f (˜ x ; θ ) be the score function vector of ( J, K ) elements with f jk ( ˜ X ; θ ) ≡ ξ j ( ˜ X ; θ ) h jk ( W ) , for each j ∈ [ J ] and k ∈ [ K ] where ˜ X ≡ ( X ′ , S ′ , W ) ′ , and let f ( θ ) ≡ AE [ f ( ˜ X ; θ )] and ˆ f ( θ ) ≡ ˆ A E n [ f ( ˜ X ; θ )] with some weight matrix A and its estimate ˆ A , where E n [ · ] represents sample mean of a random vector. For now, let A = ˆ A = I .The regularized GMM estimator ˆ θ solves the following optimization problem min θ ∈ Θ k θ k : k ˆ f ( θ ) k ∞ ≤ λ for some regularization parameter λ . Belloni et al. (2018) show the rate of convergence for the estimation error under two conditions in addition to the reg-ularization condition which is the constraint of the optimization problem shown above. Below we cite their statementunder three high-level conditions

Proposition 1 (Proposition 3.1 of Belloni et al. (2018)) . Assume the following three conditions:1. (Regularization) The regularization parameter λ satisﬁes k ˆ f ( θ ) k ∞ ≤ λ with probability at least − α

2. (Identiﬁability) The population moment function satisﬁes the following: {k f ( θ ) − f ( θ ) k ∞ ≤ ǫ, θ ∈ R ( θ ) } implies k θ − θ k l ≤ r ( ǫ ; θ , l ) for all ǫ > where R ( θ ) ≡ { θ ∈ Θ : k θ k ≤ k θ k } , and r ( · ; θ , l ) is a weakly increasing rate functiondepending on the semi-norm l .3. (Empirical moment restriction) The empirical moment function satisﬁes sup θ ∈R ( θ ) k ˆ f ( θ ) − f ( θ ) k ∞ ≤ ǫ n with probability at least − δ n .Then with probability at least − α − δ n , k ˆ θ − θ k l ≤ r ( λ + ǫ n ; θ , l ) . PREPRINT - A

PRIL

21, 2020

For the second condition of the identiﬁability, Belloni et al. (2018) offers the following sufﬁcient condition

Assumption 1 (Condition NLID for exactly sparse parameters) . Suppose that there exist T ⊂ [ L ] with cardinality s such that θ l = 0 only for l ∈ T .For each q ∈ { , } , suppose that there exists a sequence µ n such that k ( θ , l q ) ≡ inf θ ∈R ( θ ): k θ − θ k q > k G ( θ − θ ) k ∞ / k θ − θ k q ≥ s − /q µ n where G is the Jacobian matrix of f ( θ ) .Suppose further that {k f ( θ ) − f ( θ ) k ∞ ≤ ǫ, θ ∈ R ( θ ) } implies that k G ( θ − θ ) k ∞ / ≤ ǫ for all ǫ ≤ ǫ ∗ for some ǫ ∗ . The last condition in the above assumption 1 is speciﬁc to the non-linear problem. Nevertheless, this assumption doesnot bind in our model because our target moment function is continuously differentiable everywhere.The second condition in assumption 1 regulates the modulus of continuity k ( θ , l ) . Lemma 3.1 of Belloni et al. (2018)offers a sufﬁcient condition for the second condition in exactly sparse model. For the linear IV regression model, forany sub-vector of covariates X , we need some sub-vector of instruments W such that E [ W ′ X ] is non-singular. Inother words, there exists some instruments that are strong for any sub-vector of endogenous covariates. In our contextof the BLP model, the Jacobian matrix is JK × L matrix with each l entry for jk element as G jk,l ( ˜ X, θ ) = ( E [ h jk ( W ) X jl ] for ≤ l ≤ LE [ h jk ( W ) D j ( θ ) R ˜ β g ( l ) s ( ˜ β ; ˜ X, θ ) P Jj ′ =1 (1 − s j ′ ( ˜ β ; ˜ X, θ )) X j ′ l dF ˜ β ] for L + 1 ≤ l ≤ L where s ( ˜ β ; ˜ X, θ ) is J × vector of s j ( ˜ β ; ˜ X, θ ) ≡ exp( X ′ j β + ξ j ( ˜ X ; θ )+ P g X ′ jg γ g ˜ β g )1+ P j ′∈ [ J ] exp( X ′ j ′ β + ξ j ′ ( ˜ X ; θ )+ P g X ′ j ′ g γ g ˜ β g ) and D j ( θ ) is the j -throw of the inverse matrix of the Jacobian matrix of s ( ˜ X ; θ ) ≡ s ( X, X ′ β + ξ ( ˜ X ; θ ); θ ) vector with respect to the meanutility vector. Therefore, the modulus of continuity condition requires that the variables h jk ( W ) serve as the stronginstruments for the attribute l of the product j , X jl , as well as the weighted sum of the attributes l of the products j ′ across the market X j ′ l . This is not a strong restriction for the most of the attributes as we often assume the attributesare exogenous. For the endogenous attributes, we need to be cautious on the restriction as the instruments are notnecessarily strong in particular when the asymptotic is considered for the size of markets J rather than the number ofmarkets n . See Armstrong (2016) for the relevant discussion. Lemma 1 (Lemma 3.4 of Belloni et al. (2018)) . Under assumption 1, for all < ǫ ≤ ǫ ∗ {k f ( θ ) − f ( θ ) k ∞ ≤ ǫ, θ ∈ R ( θ ) } implies k θ − θ k l ≤ r ( ǫ ; θ , l ) ≤ ǫs /q µ − n . Now, let ∆ θ ≡ θ − θ for any θ, θ ∈ Θ , where Θ is a subset of R L deﬁned such that ∆ θ ∈ Θ .Next, we consider the bound for the tail-probability of the estimation error process: sup ∆ θ ∈ Θ ,j ∈ [ J ] ,k ∈ [ K ] (cid:12)(cid:12)(cid:12) G n ( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) (cid:12)(cid:12)(cid:12) where G n X is the empirical process of the sequence { X i } i ∈ [ n ] .For the rest of the discussion, we introduce the following indices. For each g ∈ { , , . . . , G } and j ∈ [ J ] , let X j,g bea sub-vector of X j corresponding g -th partition of [ L ] with X j, ≡ X j , and let θ g and ν jg be deﬁned as ν jg ≡ X j,g θ g ≡ (cid:26) X ′ j β if g = 0 X ′ j,g γ g if g > PREPRINT - A

PRIL

21, 2020and ξ j ( ν ; ˜ X ) ≡ ξ ( ˜ X ; θ ) such that ν j ′ g = X j ′ ,g θ g for every j ′ ∈ [ J ] and g ∈ { , , . . . , G } .Also, let ν j ′ g ≡ X j ′ ,g θ g ≡ (cid:26) X j ′ β if g = 0 X j ′ ,g γ g if g > . In lemma 4 in the Appendix, we show that the score functions f jk are Lipschitz continuous in ν j ′ g uniformly for every ν − j ′ , − g with the Lipschitz constant J times some universal constant. Using this property, we employ the Ledaux-Talagrand contraction inequality as follows: Theorem 1.

In addition to the assumptions for lemma 4 in the Appendix, suppose the following1. sup ∆ θ ∈ Θ ,j ∈ [ J ] ,k ∈ [ K ] E n V ar ( f jk ( ˜ X, θ + ∆ θ ) − f jk ( ˜ X, θ )) ≤ B n , and2. max j ∈ [ J ] ,l ∈ [ L ] ,k ∈ [ K ] E n ( X jl h jk ( W )) ≤ B n , and k n − / G n ( f ( ˜ X, θ )) k ∞ ≤ n − / l n with probability atleast − δ n / then, sup θ ∈R ( θ ) k ˆ f ( θ ) − f ( θ ) k ∞ ≤ n − / (˜ l n + l n ) with probability at least − δ n , where ˜ l n ≡ C ( B n + ( J G )(2 √ B n sup θ k θ k log / (8 J GKL/δ n ))) with a universal constant C .Proof. In the same argument of theorem 3.2 of Belloni et al. (2018), k n − / G n ( f ( ˜ X, θ )) k ∞ ≤ n − / l n implies thatwe only need to bound the following empirical process max j ∈ [ J ] ,k ∈ [ K ] sup ∆ θ (cid:12)(cid:12)(cid:12) G n ( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) (cid:12)(cid:12)(cid:12) . By taking t ≥ B n , we may apply Chebyshev inequality and symmetrization lemma (Lemma 2.3.7 ofvan der Vaart and Wellner, 1996) so that P (cid:18) max j ∈ [ J ] ,k ∈ [ K ] sup ∆ θ ∈ Θ (cid:12)(cid:12)(cid:12) G n ( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) (cid:12)(cid:12)(cid:12) > t (cid:19) ≤ P (cid:18) max j ∈ [ J ] ,k ∈ [ K ] sup ∆ θ ∈ Θ (cid:12)(cid:12)(cid:12) G n σ ( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) (cid:12)(cid:12)(cid:12) > t/ (cid:19) where σ is iid Rademacher variable taking − and with equal probability independent of all the others.Following the step 1 of lemma D.3 of Belloni et al. (2018), by conditioning on Ω n ≡{ max j ∈ [ J ] ,l ∈ [ L ] ,k ∈ [ K ] E n ( X jl h jk ( W )) ≤ B n } , we consider bounding the tail probability conditional on theevent Ω and ˜ X , P (cid:18) max j ∈ [ J ] ,k ∈ [ K ] sup ∆ θ ∈ Θ (cid:12)(cid:12)(cid:12) G n σ ( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) (cid:12)(cid:12)(cid:12) > t/ (cid:12)(cid:12)(cid:12)(cid:12) Ω n , ˜ X (cid:19) . From now on, omit the conditioning for the notational simplicity. By Markov inequality, we have P (cid:18) max j ∈ [ J ] ,k ∈ [ K ] sup ∆ θ ∈ Θ (cid:12)(cid:12)(cid:12) G n σ ( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) (cid:12)(cid:12)(cid:12) > t/ (cid:19) ≤ E σ exp (cid:16) φ max j ∈ [ J ] ,k ∈ [ K ] sup ∆ θ ∈ Θ (cid:12)(cid:12)(cid:12) G n σ ( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) (cid:12)(cid:12)(cid:12)(cid:17) exp( t/ φ ) . where φ ≡ t/ (16 J GB n sup θ k θ k ) . 5 PREPRINT - A

PRIL

21, 2020By the mean value theorem, there exists a mean value vector ˜ θ as a function of ∆ θ and its corresponding index vector ˜ ν as a function of ∆ ν ≡ ν − ν given the ﬁxed matrix of X such that f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ ) = G X g =0 X j ′ ∈ [ J ] dξ j (˜ ν ; ˜ X ) dν j ′ g ( ν j ′ g − ν j ′ g ) h jk ( W ) . Let N be the support of ν given the conditioning ˜ X and the parameter space Θ . Therefore, we have E σ exp (cid:16) φ max j ∈ [ J ] ,k ∈ [ K ] sup ∆ θ ∈ Θ (cid:12)(cid:12)(cid:12) G n σ ( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) (cid:12)(cid:12)(cid:12)(cid:17) exp( t/ φ ) ≤ E σ exp (cid:16) φ max j ∈ [ J ] ,k ∈ [ K ] sup ∆ ν ∈N (cid:12)(cid:12)(cid:12) G n σ P Gg =0 P j ′ ∈ [ J ] dξ j (˜ ν ; ˜ X ) dν j ′ g ( ν j ′ g − ν j ′ g ) h jk ( W ) (cid:12)(cid:12)(cid:12)(cid:17) exp( t/ φ ) ≤ E σ exp (cid:16) φ P Gg =0 P j ′ ∈ [ J ] max j,j ′ ∈ [ J ] ,g ∈ [ G ] ,k ∈ [ K ] sup ∆ ν ∈N (cid:12)(cid:12)(cid:12) G n σ dξ j (˜ ν ; ˜ X ) dν j ′ g ( ν j ′ g − ν j ′ g ) h jk ( W ) (cid:12)(cid:12)(cid:12)(cid:17) exp( t/ φ ) ≤ E σ exp (cid:16) JGφ max j,j ′ ∈ [ J ] ,g ∈ [ G ] ,k ∈ [ K ] sup ∆ ν ∈N (cid:12)(cid:12)(cid:12) G n σ dξ j (˜ ν ; ˜ X ) dν j ′ g ( ν j ′ g − ν j ′ g ) h jk ( W ) (cid:12)(cid:12)(cid:12)(cid:17) exp( t/ φ ) ≤ J GK max j,j ′ ∈ [ J ] ,g ∈ [ G ] ,k ∈ [ K ] E σ exp (cid:16) JGφ sup ∆ ν ∈N (cid:12)(cid:12)(cid:12) G n σ dξ j (˜ ν ; ˜ X ) dν j ′ g ( ν j ′ g − ν j ′ g ) h jk ( W ) (cid:12)(cid:12)(cid:12)(cid:17) exp( t/ φ ) . The ﬁrst inequality follows from the mean value theorem and the fact that the supremum over ∆ θ is dominated bythe supremum over ∆ ν whcih are constrained conditional on each realization of the matrix X . The second inequalityfollows from the triangular inequality. The third inequality follows from the union bound over j ′ , g by taking themaximum over j ′ , g indices. Finally, we take the union bound over j, j ′ , g, k indices.To apply Ledoux-Talagrand contraction inequality (Theorem 4.12, Ledoux and Talagrand, 1991) in bounding the fol-lowing term exp  JGφ sup ∆ ν ∈N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ [ n ] σ i dξ ij (˜ ν i ; ˜ X i ) dν j ′ g (∆ ν i,j ′ g ) h jk ( W i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , let ψ jkj ′ gi (∆ ν ij ′ g ) ≡ " dξ j (˜ ν i ; ˜ X i )∆ ν j ′ g dν j ′ g h jk ( W i ) then by lemma 4, we have | ψ jkj ′ gi (∆ ν ij ′ g ) | ≤ C J | ∆ ν ij ′ g | uniformly over ˜ ν i .Since ψ jkj ′ gi (0) = 0 , Ledoux-Talagrand contraction inequality via corollary 3 applies so that J GK max j ′ ,j ∈ [ J ] ,k ∈ [ K ] ,g ∈ [ G ] E σ exp (cid:16) JGφ sup ∆ ν j ′ g ∈N (cid:12)(cid:12)(cid:12) G n σ (cid:16) dξ j (˜ ν ; ˜ X ) dν j ′ g (cid:17) ∆ ν j ′ g h jk ( W ) (cid:12)(cid:12)(cid:12)(cid:17) exp( t/ φ ) ≤ J GK max j ′ ,j ∈ [ J ] ,k ∈ [ K ] ,g ∈ [ G ] E σ exp( C J Gφ sup ∆ ν j ′ g ∈N | G n σ ∆ ν j ′ g h jk ( W ) | )exp( t/ φ ) ≤ J GK max j ′ ,j ∈ [ J ] ,k ∈ [ K ] ,g ∈ [ G ] E σ exp( C J Gφ sup ∆ θ g ∈ Θ g | G n σh jk ( W ) X j ′ ,g ∆ θ g | )exp( t/ φ ) . PREPRINT - A

PRIL

21, 2020By Holder inequality, we have J GK max j ′ ,j ∈ [ J ] ,k ∈ [ K ] ,g ∈ [ G ] E σ exp( C J Gφ sup ∆ θ g ∈ Θ g | G n σh jk ( W ) X j ′ ,g ∆ θ g | )exp( t/ φ ) ≤ J KG max j ′ ,j ∈ [ J ] ,k ∈ [ K ] E σ exp( C J Gφ max l ∈ [ L ] | G n σh jk ( W ) X j ′ ,l | sup ∆ θ k ∆ θ k )exp( t/ φ ) ≤ J GKL max j,j ′ ∈ [ J ] ,k ∈ [ K ] ,l ∈ [ L ] E σ exp( C J Gφ | G n σh jk ( W ) X j ′ ,l | sup ∆ θ k ∆ θ k )exp( t/ φ ) ≤ J GKL max j,j ′ ∈ [ J ] ,k ∈ [ K ] ,l ∈ [ L ] C J G φ E n ( h jk ( W ) X j ′ ,l ) sup ∆ θ ∈ Θ k ∆ θ k )exp( t/ φ ) ≤ J GKL max j,j ′ ∈ [ J ] ,k ∈ [ K ] ,l ∈ [ L ] C J G φ B n sup θ ∈ Θ k θ k )exp( t/ φ ) from the symmetry of distribution and sub-Gaussianity. Then the stated bound is achieved by following the analogueargument of Lemma D.3 of Belloni et al. (2018).Then the following statement shows the error rate of RGMM BLP estimator: Theorem 2.

Under assumption 1, and assumptions for theorem 1. Assume further that k ˆ f ( θ ) k ∞ ≤ λ with probability at least − α , then we have for each q ∈ { , }k θ − θ k q ≤ s /q µ − n n − / (˜ l n + l n ) with probability at least − α − δ n , where ˜ l n ≡ C ( B n + ( J G )(2 √ B n sup θ k θ k log / (8 J GKL/δ n ))) with a universal constant C . Given the RGMM estimator ˆ θ , it is recommended that we update the estimate in order to make a proper inference.De-biased Lasso, or De-biased RGMM procedure in Belloni et al. (2018) takes the following steps1. Estimate the RGMM ˆ θ

2. Estimate the plug-in gradient ˆ G = ∂ θ ′ ˆ f (ˆ θ ) and the plug-in var-cov matrix ˆΩ = E n f ( ˜ X ; ˆ θ ) f ( ˜ X ; ˆ θ ) ′

3. Solve the minimization problem of min γ ∈ R L × JK X l ∈ [2 L ] k γ l k subject to k γ l ˆΩ − ( ˆ G ′ ) l k ∞ ≤ λ γl for some regularization parameters λ γl

4. Solve the minimization problem of min µ ∈ R JK × L X j ∈ [ JK ] k µ j k subject to k µ j ˆ γ ˆ G − e ′ j k ∞ ≤ λ µj for some regularization parameters λ µj , where e j is a coordinate vector with 1 in the j -th position and 0elsewhere. 7 PREPRINT - A

PRIL

21, 20205. Update the RGMM estimator as ˆ θ − ˆ µ ˆ γ ˆ f (ˆ θ ) .First of all, we need to provide maximal inequalities for the auxiliary estimators ˆ γ and ˆ µ . The strategy follows theparallel argument of the maximal inequality for ˆ θ . Therefore, we need the following modulus of continuity conditionsfor γ and µ . Assumption 2.

Suppose that there exists a sequence µ n such that inf γ ∈R ( γ ): k γ − γ k > k ( γ − γ )Ω k ∞ / k γ − γ k ≥ s − µ n and inf µ ∈R ( µ ): k µ − µ k > k ( µ − µ ) G ′ Ω G k ∞ / k µ − µ k ≥ s − µ n . Note that the ﬁrst condition requires that the variance matrices constructed from any elements of the score functions f jk ( X, θ ) is non-singular, and the second condition follows if all eigenvalues of G ′ Ω G are bounded in absolute valuesfrom zero uniformly over n .Then given a choice of penalty parameters λ γl and λ µj , we achieve the maximal inequality for L -norm of ˆ γ l and ˆ µ j Lemma 2 (Lemma 3.7 of Belloni et al. (2018)) . Let l Ω n and l Gn such that n / k ˆΩ − Ω k ∞ ≤ l Ω n and n / k ˆ G − G k ∞ ≤ l Gn with probability − δ n . Suppose max l ∈ [2 L ] k γ l k ≤ ¯ C , and max j ∈ [ JK ] k µ j k ≤ ¯ C .Let λ γl satisfy n / λ γl ≥ ¯ Cl Ω n + l Gn ,λ γl ≤ n − / l n and λ µj satisfy n / λ µj ≥ C l Gn + ¯ C l Ω n + ¯ C max l ∈ [2 L ] n / λ γl λ µj ≤ n − / l ′ n . for l ∈ [2 L ] and j ∈ [ JK ] . Suppose assumption 2 holds. Then with probability − δ n , we have max l ∈ [2 L ] k ˆ γ l − γ l k ≤ sl n (2 + ¯ C ) µ √ n and with probability − δ n max j ∈ [ JK ] k ˆ µ j − µ j k ≤ sl ′ n (2 + ¯ C ) µ n √ n . Next, we need maximal inequalities for the norms k ˆ G − G k ∞ , k ˆ G − ˜ G k ∞ , and k ˆΩ − Ω k ∞ where ˜ G ≡ − ˆ G (˜ θ ) with ˜ θ as the intermediate value of ˆ θ and θ . Unlike lemma 3.7 of Belloni et al. (2018) which assumes the tail probabilitybound for the process ˆΩ − Ω and ˆ G − G , we need certain modiﬁcation of lemma as we do for theorem 1. Lemma 3.

Suppose the following1. max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [ L ] E (cid:20) h jk ( W ) max j ′′ ∈ [ J ] X j ′′ ,l max j ′ ∈ [ J ] ,l ′ ∈ [ L ] X j ′ ,l ′ (cid:21) ≤ C,

2. with probability − δ n , we have max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [ L ] E n (cid:20) h jk ( W ) max j ′′ ∈ [ J ] X j ′′ ,l max j ′ ∈ [ J ] ,l ′ ∈ [ L ] X j ′ ,l ′ (cid:21) ≤ B n , max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [ L ] E n [ h jk ( W ) X jl ] ≤ B n and max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] ,l ∈ [ L ] E n [ h jk ( W ) h j ′ ,k ′ ( W ) X jl ξ j ′ ( ˜ X ; θ ) ] ≤ B n with probability − δ n . PREPRINT - A

PRIL

21, 2020

3. with probability − δ n , we have k ˆ θ − θ k q ≤ ∆ qn for q ∈ { , } .4. max j ∈ [ J ] ,k ∈ [ K ] E [ f jk ( ˜ X ; θ )] ≤ C , and n − / E h max i ∈ [ n ] k f ( ˜ X i ; θ ) k ∞ i ≤ min { δ n , log − / ( JK ) } max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] E h G jk,l ( ˜ X ; θ ) i ≤ C and n − / E [max i ∈ [ n ] k G ( ˜ X i ; θ ) k ∞ ] ≤ min { δ n , log − / (2 JKL ) } . Then with probability − C ′ δ n we have k ˆ G − G k ∞ ≤ C ′ p n − log(2 JKL ) + C ′ J GB n ∆ n p n − log(2 J GKL/δ n ) + C ′ J / ∆ n . k ˆ G − ˜ G k ∞ ≤ C ′ p n − log(2 JKL ) + C ′ J GB n ∆ n p n − log(2 J GKL/δ n ) + C ′ J / ∆ n and k ˆΩ − Ω k ∞ ≤ C ′ p n − log( JK ) + C ′ J GB n ∆ n p n − log(2 J GKL/δ n ) + 2 J / C ( J / ∆ n + ∆ n ) . Proof.

We follow the proof of lemma 3.9 of Belloni et al. (2018). First we bound k ˆ G − G k ∞ . Note that √ n k ˆ G − G k ∞ is bounded by the sum of the following three terms from triangular inequality:(1.1) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] | G n ( G jk,l ( ˜ X ; ˆ θ ) − G jk,l ( ˜ X ; θ )) | (1.2) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] | G n G jk,l ( ˜ X ; θ )) | (1.3) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] n / | E [ G jk,l ( ˜ X ; ˆ θ ) − G jk,l ( ˜ X ; θ )] | From lemma C.1 (4) of Belloni et al. (2018), the second term, (1.2), is bounded by max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] | G n G jk,l ( ˜ X ; θ )) | ≤ C max j,k,l n X i E h | G jk,l ( ˜ X i ; θ )) | i log / (2 JKL )+ n − / C (cid:26) E (cid:20) max i ≤ n k G ( ˜ X i ; θ )) k ∞ (cid:21) ( δ − n + δ − / n + log (2 JKL )) (cid:27) ≤ C log / (2 JKL ) with probability − δ n by the condition 5. 9 PREPRINT - A

PRIL

21, 2020For the last term, (1.3), we use the linear expansion of G jk,l ( ˜ X ; θ ) into P j ′ ∈ [ J ] ,g ∈ [ G ] h jk ( W ) B j ′ ,g ( X l )( ν j ′ g − ν j ′ g ) from lemma 6 so that max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] n / (cid:12)(cid:12)(cid:12) E [ G jk,l ( ˜ X, ˆ θ ) − G jk,l ( ˜ X, θ )] (cid:12)(cid:12)(cid:12) ≤ (1) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] n / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  X j ′ ∈ [ J ] ,g ∈ [ G ] h jk ( W ) B j ′ ,g ( X l )( ν j ′ g − ν j ′ g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (2) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] n / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  X j ′ ∈ [ J ] ,g ∈ [ G ] h jk ( W ) B j ′ ,g ( X l ) X ′ j ′ g ( θ g − θ g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (3) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] n / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  X j ′ ∈ [ J ] ,l ′ ∈ [ L ] [ h jk ( W ) B j ′ ,g ( l ′ ) ( X l ) X j ′ ,l ′ ( θ l ′ − θ l ′ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / ≤ (4) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] n / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  h jk ( W ) X j ′ ∈ [ J ] ,l ′ ∈ [ L ] max j ′ ∈ [ J ] ,l ′ ∈ [ L ] [ B j ′ ,g ( l ′ ) ( X l ) X j ′ ,l ′ ]( θ l ′ − θ l ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / ≤ (5) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] n / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  h jk ( W ) J ¯ C max j ′′ ∈ [ J ] X j ′′ ,l max j ′ ∈ [ J ] ,l ′ ∈ [ L ] X j ′ ,l ′ X j ′ ∈ [ J ] ,l ′ ∈ [ L ] | θ l ′ − θ l ′ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / ≤ (6) max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] n / ¯ CJ / s E (cid:20) h jk ( W ) max j ′′ ∈ [ J ] X j ′′ ,l max j ′ ∈ [ J ] ,l ′ ∈ [ L ] X j ′ ,l ′ (cid:21) k θ − θ k ≤ Cn / J / ∆ n with probability − δ n , where (1) and (5) follows from lemma 6, (2) follows from the deﬁnition of ν jm , (3) followsfrom the monotonicity of the L p norm, (4) follows from the union bound, and (6) follows from the condition 1.The ﬁrst term, (1.1), is the empirical process in terms of the G jk,l instead of f jk . Note that E n V ar ( G jk,l ( ˜ X ; ˆ θ ) − G jk,l ( ˜ X ; θ )) ≤ E n  X j ′ ∈ [ J ] ,l ∈ [ L ] h jk ( W ) B j ′ ,g ( l ) X j ′ ,l ( θ l − θ l )  ≤ E n (cid:20) h jk ( W ) max j ′ ∈ [ J ] ,l ∈ [ L ] B j ′ ,g ( l ) X j ′ ,l (cid:21) k θ − θ k ≤ CJ E n (cid:20) h jk ( W ) max j ′′ ∈ [ J ] X j ′′ ,l max j ′ ∈ [ J ] ,l ∈ [ L ] X j ′ ,l (cid:21) k θ − θ k ≤ CJ B n ∆ n by Hölder inequality and lemma 6.Then the conditions for the Corollary 2 holds with B n = JB n ∆ n , and B n = B n so that max j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] | G n ( G jk,l ( ˜ X ; ˆ θ ) − G jk,l ( ˜ X ; θ )) | ≤ J GCB n ∆ n log / (2 J GKL/δ n ) with probability − δ n .In the same argument of Belloni et al (2018), k ˆ G − ˜ G k ∞ has the same bound as k ˆ G − G k ∞ . Finally, we consider k ˆΩ − Ω k . As we do for k ˆ G − ˜ G k ∞ , n − / k ˆΩ − Ω k is bounded by the sum of the following three terms(2.1) max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | G n ( f jk ( ˜ X ; ˆ θ ) f j ′ k ′ ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ ) f j ′ k ′ ( ˜ X ; θ ) | (2.2) max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | G n f jk ( ˜ X ; θ ) f j ′ k ′ ( ˜ X ; θ ) | (2.3) max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] n / | E [ f jk ( ˜ X ; ˆ θ ) f j ′ k ′ ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ ) f j ′ k ′ ( ˜ X ; θ )] | .The ﬁrst term, (2.1), is further bounded by the sum of two terms10 PREPRINT - A

PRIL

21, 2020(2.1.1) max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | G n ( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ ))( f j ′ k ′ ( ˜ X ; ˆ θ ) − f j ′ k ′ ( ˜ X ; θ )) | (2.1.2) j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | G n (( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ )) f j ′ k ′ ( ˜ X ; θ ) | . First, we have for (2.1.1), max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | G n ( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ ))( f j ′ k ′ ( ˜ X ; ˆ θ ) − f j ′ k ′ ( ˜ X ; θ )) |≤ max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] n / E n [( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ )) ] / E n [( f j ′ k ′ ( ˜ X ; ˆ θ ) − f j ′ k ′ ( ˜ X ; θ )) ] / + max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] n / E [( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ )) ] / E [( f j ′ k ′ ( ˜ X ; ˆ θ ) − f j ′ k ′ ( ˜ X ; θ )) ] / ≤ max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] n / C J E n  h jk ( W )  X j ′′ ∈ [ J ] ,l ∈ [ L ] X j ′′ ,l (ˆ θ l − θ l )   / E n  h j ′ k ′ ( W )  X j ′′ ∈ [ J ] ,l ∈ [ L ] X j ′′ ,l (ˆ θ l − θ l )   / + max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] n / C J E  h jk ( W )  X j ′′ ∈ [ J ] ,l ∈ [ L ] X j ′′ ,l (ˆ θ l − θ l )   / E  h j ′ k ′ ( W )  X j ′′ ∈ [ J ] ,l ∈ [ L ] X j ′′ ,l (ˆ θ l − θ l )   / ≤ n / CB n J ∆ n with probability − Cδ n and a universal constant ¯ C , where the second and the last inequalities are by Hölder inequalityand lemma 4.For the second term, (2.1.2), note that V ar ( G n ([ f j,k ( ˜ X ; θ ) − f j,k ( ˜ X ; θ )] f j ′ ,k ′ ( X, θ ))) ≤ V ar ( G n ( h jk ( W ) h j ′ k ′ ( W ) ξ j ′ ( ˜ X ; θ ) X ′ j ( θ − θ ))) ≤ ¯ CJ B n ∆ n so that the conditions for corollary 1 holds with B n = JB n ∆ n and B n = B n , therefore, max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | G n (( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ )) f j ′ k ′ ( ˜ X ; θ ) | ≤ ¯ C J GB n ∆ n log / (2 J GKL/δ n ) . For the remaining two terms, (2.2), max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | G n f jk ( ˜ X ; θ ) f j ′ k ′ ( ˜ X ; θ ) | is bounded by C max j,k E [ f j,k ( ˜ X ; θ )] / p log( JK ) + Cn − / E [max i k f ( ˜ X i ; θ ) k ∞ ] { δ − n + log( JK ) } ≤ C ′ p log( JK ) from Lemma C.1(4) of Belloni et al. (2018) under condition 4.11 PREPRINT - A

PRIL

21, 2020Finally, for (2.3), max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] n / | E [ f jk ( ˜ X ; ˆ θ ) f j ′ k ′ ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ ) f j ′ k ′ ( ˜ X ; θ )] |≤ max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] n / | E [ f jk ( ˜ X ; ˆ θ )( f j ′ k ′ ( ˜ X ; ˆ θ ) − f j ′ k ′ ( ˜ X ; θ ))] | + | E [( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ )) f j ′ k ′ ( ˜ X ; θ )] |≤ max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] n / | E [( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ ))( f j ′ k ′ ( ˜ X ; ˆ θ ) − f j ′ k ′ ( ˜ X ; θ ))] | + 2 n / max j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | E [( f jk ( ˜ X ; ˆ θ ) − f jk ( ˜ X ; θ )) f j ′ k ′ ( ˜ X ; θ )] |≤ J C ∆ n + n / j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] E  ¯ C J h jk ( W ) h j ′ k ′ ( W ) ξ j ′ ( ˜ X ; θ )  X j ′ ∈ [ J ] ,g ∈ [ G ] X ′ j ′ ,g ( θ g − θ g )   / ≤ n / J C ∆ n + n / j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] E  ¯ C J h jk ( W ) h j ′ k ′ ( W ) ξ j ′ ( ˜ X ; θ ) max j ′ ∈ [ J ] ,l ′ ∈ [ L ] X j ′ ,l ′ X j ′ ∈ [ J ] ,l ′ ∈ [ L ] ( θ l − θ l )  / ≤ n / J C ∆ n + n / J / C ∆ n by lemma 4 and Hölder inequality.Combining these results, we attain the asymptotic linearlity Theorem 3.

Suppose that max j ∈ [ JK ] E [ f j ( ˜ X ; θ )] ≤ C and n − / E (cid:20) max i ∈ [ n ] k f ( ˜ X i ; θ ) k ∞ (cid:21) ≤ min { δ n , log − / ( JKL ) } . Suppose that k ˆ f ( θ ) k ∞ ≤ λ with probability at least − α , and assumptions for lemmas 1, 2 and 3, and theorem 1with B n + ¯ C + µ − n ≤ C. For ¯ a ≥ and C ′ ≥ , let ¯ λ = C ′ J / max { J / ˜ λ , ˜ λ } with ˜ λ ≡ n − / a J G Φ − (1 − (2 J GKLn ) − ) .Then, setting λ γl = λ µl = ¯ λ , we have with probability − α − Cδ n , √ n (ˆˆ θ − θ ) = − µ γ ˆ f ( θ ) + r with k r k ∞ ≤ Cu n where ˆˆ θ is the updated RGMM estimator provided that n − a sJ / max n n − / J / J G log (2 J GKLn ) , J G log(2 J GKLn ) o ≤ u n and ¯ λ ≥ CJ / max { J / D n , D n } for some large enough C > where D n ≡ n − / s / (log / (2 JKL ) + J G log / (2 J GKL/δ n )) .Proof. First note that the rate terms of theorem 1 satisﬁes the following l n ≤ C ′ log / (2 JKL ) and l ′ n ≤ C ′ J G log / (2 J KGL ) with probability − δ n under the assumptions. 12 PREPRINT - A

PRIL

21, 2020Therefore, ∆ qn ≤ C ′ n − / s /q ( l n + l ′ n ) ≤ C ′′ n − / s /q (log / (2 JKL ) + J G log / (2 J KGL )) ≡ D qn . with probability at least − α − δ n .To apply lemma 2, note that for s ≥ , we have n − / log / (2 JKL ) ≤ n − / s /q log(2 JKL ) ≤ D qn . Note also that D n n − / J G log / (2 J GKL/δ n ) ≤ D n . Therefore, by lemma 3, we have max {k ˆ G − G k ∞ , k ˆ G − ˜ G k ∞ , k ˆΩ − Ω k ∞ } ≤ CJ / ( J / D n + D n ) . with probability − Cδ n .Now, let l Ω n = l Gn = n / CJ / max { J / D n , D n } and λ γl = 12 λ µj = ¯ λ so that n / ¯ λ ≥ ( ¯ C + 1) n / CJ / ( J / D n + D n ) and ¯ λ ≤ n − / l n = ¯ λ. Then lemma 2 applies to get max l ∈ [ p ] k ˆ γ l − γ l k ≤ C s ¯ λ and similar deﬁnition for l ′ n gives max j ∈ [ JK ] k ˆ µ j − µ j k ≤ C s ¯ λ. By lemma 3.6 of Belloni et al. (2018), the decomposed error rates deﬁned in the lemma, ¯ r , ¯ r and ¯ r are bounded by ¯ r ≤ √ n ¯ λ ∆ n ¯ r ≤ Cn / (∆ n + J ∆ n )∆ n ¯ r ≤ Cn − / ¯ λ ∆ n . Thus, k r k ≤ Cu n for u n such that n − / a sJ / max { n − / J / J G log / (2 J GJKLn ) , J G log(2 J GJKLn ) } ≤ u n withprobability − α − Cδ n . In this paper, we propose a l -penalized estimation for random coefﬁcient logit model of differentiated product de-mands. Unlike the existing approach, our procedure allows for random coefﬁcients on possibly high-dimensionalattributes. Therefore, both of the mean and variance of the indirect utilities may be determined by high-dimensionalbut sparse set of attributes.Our strategy bases on the contraction inequality by Ledoux and Talagrand (1991) as is used in Belloni et al. (2018)for the GMM procedure with a single index. We show that the l -regularized GMM estimation and its de-biasedprocedure are valid for the BLP model with ﬁxed number of indices generated out of high-dimensional but sparse setof attributes.Unfortunately, the contraction inequality principle does not apply to a fully ﬂexible random coefﬁcient BLP modelas the number of indices grows in the same rate as the number of the attributes. Also, our current result does notaccommodate the models with the number of products growing exponentially. These challenges are left for the futurework. 13 PREPRINT - A

PRIL

21, 2020

References A RMSTRONG , T. B. (2016): “Large Market Asymptotics for Differentiated Product Demand Estimators with Eco-nomic Models of Supply,”

Econometrica , 84, 1961–1980.B

ELLONI , A., V. C

HERNOZHUKOV , D. C

HETVERIKOV , C. H

ANSEN , AND

K. K

ATO (2018): “High-DimensionalEconometrics and Regularized GMM,” arXiv preprint arXiv:1806.01888 .B ERRY , S., J. L

EVINSOHN , AND

A. P

AKES (1995): “Automobile Prices in Market Equilibrium,”

Econometrica ,841–890.B

ERRY , S., O. B. L

INTON , AND

A. P

AKES (2004): “Limit Theorems for Estimating the Parameters of DifferentiatedProduct Demand Systems,”

The Review of Economic Studies , 71, 613–654.B

ERRY , S. T. (1994): “Estimating Discrete-Choice Models of Product Differentiation,”

Rand Journal of Economics ,242–262.F

OSTER , D. J.

AND

A. R

AKHLIN (2019): “$l_\infty$ Vector Contraction for Rademacher Complexity,” arXiv preprintarXiv:1911.06468 .F RANK , I.

AND

J. F

RIEDMAN (1993): “A Statistical View of Some Chemometrics Regression Tools.”

Technometrics ,35, 109–135.G

ILLEN , B. J., S. M

ONTERO , H. R. M

OON , AND

M. S

HUM (2019): “BLP-2LASSO for Aggregate Discrete ChoiceModels with Rich Covariates,”

The Econometrics Journal , 22, 262–281.L

EDOUX , M.

AND

M. T

ALAGRAND (1991):

Probability in Banach Spaces: Isoperimetry and Processes , Springer.T

IBSHIRANI , R. (1996): “Regression Shrinkage and Selection via the Lasso,”

Journal of the Royal Statistical Society,Series B , 58, 267–288.

VAN DER V AART , A.

AND

J. W

ELLNER (1996):

Weak Convergence and Empirical Processes: With Applications toStatistics , Springer.

A Supporting Lemmas

We consider the bound for the linear expansion of ξ j with respect to the index x ′ j ′ g γ g when the share of each productfor any ˜ β fall in the shrinking range of c /J and c /J . This is the same assumption employed in Berry, Linton, andPakes (2004). All these arguments should apply to the special case that the number of product J is a ﬁxed constantwhen every product has non-zero share in every market. Lemma 4.

Let s j ( ˜ β ; ˜ x, θ ) ≡ exp( x ′ j β + ξ j (˜ x ; θ ) + P g ∈ [ G ] x ′ jg γ g ˜ β g )1 + P j ′ ∈ [ J ] exp( x ′ j ′ β + ξ j ′ (˜ x ; θ ) + P g ∈ [ G ] x ′ j ′ g γ g ˜ β g ) . Assume that c J < s j ( ˜ β ; ˜ x, θ ) < c J for almost every ˜ β ∈ R G and x , and for every j ∈ { , , . . . , J } .Then for any pair of index values ν j ′ g and ν j ′ g for each g ∈ [ G ] , j ′ ∈ [ J ] , we have (cid:12)(cid:12)(cid:12)(cid:12) dξ j (˜ ν ; ˜ x ) dν j ′ g ( ν j ′ g − ν j ′ g ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C J | ν j ′ g − ν j ′ g | with a universal constant C uniformly over ˜ ν and ˜ x .Proof. First note that dξ (˜ x ; θ ) dx ′ j ′ g γ g = (cid:20) ∂s (˜ x ; θ ) ∂ξ (cid:21) − " ∂s (˜ x ; θ ) ∂x ′ j ′ g γ g = (cid:20) ∂s (˜ x ; θ ) ∂ξ (cid:21) − Z ˜ β g s ( ˜ β ; ˜ x, θ )(1 − s j ′ ( ˜ β ; ˜ x, θ )) dF ˜ β , where s ( ˜ β ; ˜ x, θ ) is a vector of s j ( ˜ β ; ˜ x, θ ) . 14 PREPRINT - A

PRIL

21, 2020Then,

RHS ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ′′ ∈ [ J ] d jj ′′ ( θ ) Z ˜ β g s j ′′ ( ˜ β ; ˜ x, θ )(1 − s j ′ ( ˜ β ; ˜ x, θ )) dF ˜ β ( ν j ′ g − ν j ′ g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where d jj ′′ ( θ ) is ( j, j ′′ ) element of h ∂s (˜ x ; θ ) ∂ξ i − ≡ D ( θ ) matrix.For the inverse matrix elements d jj ′′ ( θ ) , Berry et al. (2004, p.657) show that the upper bound of the D matrix in thepositive deﬁnite sense, i.e., x ′ (cid:18) diag ( s , . . . , s J ) − + ii ′ s − D ( θ ) (cid:19) x > for any non-zero vector x where s j are the lower bounds of the shares satisfying the rate condition in the assumption.Thus, each element of | d jj ′′ ( θ ) | is bounded above by s j + s ≤ J/c .Now, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ′′ ∈ [ J ] d jj ′′ ( θ ) Z ˜ β g s j ′′ ( ˜ β ; ˜ x, θ )(1 − s j ′ ( ˜ β ; ˜ x, θ )) dF ˜ β ( ν j ′ g − ν j ′ g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ′′ ∈ [ J ] | d jj ′′ ( θ ) | Z | ˜ β g | s j ′′ ( ˜ β ; ˜ x, θ )(1 − s j ′ ( ˜ β ; ˜ x, θ )) dF ˜ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | ν j ′ g − ν j ′ g |≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Jc X j ′′ ∈ [ J ] Z | ˜ β g | s j ′′ ( ˜ β ; ˜ x, θ ) dF ˜ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | ν j ′ g − ν j ′ g |≤ (cid:12)(cid:12)(cid:12)(cid:12) Jc Z | ˜ β g | (1 − s ( ˜ β ; ˜ x, θ )) dF ˜ β (cid:12)(cid:12)(cid:12)(cid:12) | ν j ′ g − ν j ′ g |≤ (cid:12)(cid:12)(cid:12)(cid:12) Jc Z | ˜ β g | dF ˜ β (cid:12)(cid:12)(cid:12)(cid:12) | ν j ′ g − ν j ′ g | ≤ C J | ν j ′ g − ν j ′ g | . Remark 1.

One may achieve L ∞ -Lipschitz result for the vector of indices { ν jg } j ∈ [ J ] . The L ∞ -Lipschitz constantcan be invariant to the number of products J , from the assumption that P j (1 − s j ( ˜ β )) = s ( ˜ β ) ≤ c /J . From thisproperty, the variance term of Berry et al. (2004) achieves J rate, instead of J . Therefore, the rate of convergence maybe improved with respect to the number of products J relative to the one in this paper. Nevertheless, Ladeau-Talagrandcontraction inequality does not apply with the L ∞ -Lipschitz case. While there is a recent study by Foster and Rakhlin(2019) showing a tail probability bound for the Rademacher average with L ∞ -Lipschitz mapping, it is not trivial toapply to our case. Next we show the sufﬁcient conditions for the empirical process at the true parameter value θ is bounded by the log ( JK ) rate. Lemma 5.

Assume that there is some σ > such that max j ∈ [ J ] ,l ∈ [2 L ] ,k ∈ [ K ] E n ( X jl h jk ( W )) ≤ σ and log( JKn ) √ n (cid:16) E [ k h jk ( W ) k ∞ ξ ( ˜ X ; θ ) (cid:17) / ≤ σ Then k n − / G n ( g ( ˜ X, θ )) k ∞ ≤ n − / Cσ p log( JK ) with probability at least − δ/ log ( n ) .Proof. The analogue argument in example 7 of Belloni et al (2018) in the application of lemma A.2 and A.3 showsthe result. 15

PREPRINT - A

PRIL

21, 2020

Lemma 6.

Assume the assumptions for lemma 4. Let ν j ′ g and ν j ′ g as JG vectors of indices as previously deﬁned.Then there exists a sequence B j ′ g ( x l ) which depends on the intermediate value of θ and θ such that dξ j (˜ x ; θ ) dθ l − dξ j (˜ x ; θ ) dθ l = X j ′ ∈ [ J ] ,g ∈ [ G ] B j ′ g ( x l )( ν j ′ g − ν j ′ g ) , and | B j ′ g ( x l ) | ≤ ¯ CJ max j ′ ∈ [ J ] | x j ′ l | with a universal constant ¯ C for any value of θ and x l .Proof. Observe that dξ j (˜ x ; θ ) dθ l = D j ( θ ) Z ˜ β g ( l ) s ( ˜ β ; ˜ x, θ ) X j ′ ∈ [ J ] (1 − s j ′ ( ˜ β ; ˜ x, θ )) x j ′ l dF ˜ β . Below, we omit ˜ x as the arguments of s j (˜ x ; θ ) and s j ( ˜ β ; ˜ x, θ ) for notational simplicity.Now we have (cid:12)(cid:12)(cid:12)(cid:12) dξ j (˜ x ; θ ) dθ l − dξ j (˜ x ; θ ) dθ l (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ′ ,j ′′ ∈ [ J ] Z ˜ β g ( l ) [ d jj ′′ ( θ ) s j ′′ ( θ )(1 − s j ′ ( θ )) − d jj ′′ ( θ ) s j ′′ ( θ )(1 − s j ′ ( θ ))] x j ′ l dF ˜ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Thus, X j ′ ,j ′′ ∈ [ J ] Z ˜ β g ( l ) [ d jj ′′ ( θ ) s j ′′ ( θ )(1 − s j ′ ( θ )) − d jj ′′ ( θ ) s j ′′ ( θ )(1 − s j ′ ( θ ))] x j ′ l dF ˜ β = X j ′ ,j ′′ ∈ [ J ] Z ˜ β g ( l ) [ d jj ′′ ( θ )( s j ′′ ( θ )(1 − s j ′ ( θ )) − s j ′′ ( θ )(1 − s j ′ ( θ ))) + ( d jj ′′ ( θ ) − d jj ′′ ( θ )) s j ′′ ( θ )(1 − s j ′ ( θ ))] x j ′ l dF ˜ β First consider expanding s j ′′ ( θ )(1 − s j ′ ( θ )) − s j ′′ ( θ )(1 − s j ′ ( θ )) with respect to ν jg − ν jg . We have ds j ′′ ( θ )(1 − s j ′ ( θ )) dν jg = Z ˜ β g s j ′′ ( ˜ β ; θ ) s j ′ ( ˜ β ; θ )(1 − s j ( ˜ β ; θ )) dF ˜ β . Therefore, by the mean value theorem, there exists an intermediate value vector ˜ θ , s j ′′ ( θ )(1 − s j ′ ( θ )) − s j ′′ ( θ )(1 − s j ′ ( θ )) = J X j =1 G X g =0 Z ˜ β g s j ′′ ( ˜ β ; ˜ θ ) s j ′ ( ˜ β ; ˜ θ )(1 − s j ( ˜ β ; ˜ θ )) dF ˜ β ( ν jg − ν jg ) . Next consider expanding d j ′ j ′′ ( θ ) − d j ′ j ′′ ( θ ) with respect to ν jg − ν jg . Observe that dD − ( θ ) dν jg = − D − ( θ ) (cid:20) ddν jg ds j ( θ ) dξ j (cid:21) j ,j D − ( θ )= − D − ( θ ) (cid:20) ddν jg s j ( θ )(1 − s j ( θ )) (cid:21) j ,j D − ( θ )= − D − ( θ ) (cid:20)Z ˜ β g s j ( ˜ β ; θ ) s j ( ˜ β ; θ )(1 − s j ( ˜ β ; θ )) dF ˜ β (cid:21) j ,j D − ( θ )= −  X j ∈ [ J ] X j ∈ [ J ] d j ′ j ( θ ) d j j ′′ ( θ ) Z ˜ β g s j ( ˜ β ; θ ) s j ( ˜ β ; θ )(1 − s j ( ˜ β ; θ )) dF ˜ β  j ′ ,j ′′ . Therefore, by the mean value theorem d j ′ j ′′ ( θ ) − d j ′ j ′′ ( θ ) = X j,j ,j ∈ [ J ] G X g =0 d j ′ j (˜ θ ) d j j ′′ (˜ θ ) Z ˜ β g s j ( ˜ β ; ˜ θ ) s j ( ˜ β ; ˜ θ )(1 − s j ( ˜ β ; ˜ θ )) dF ˜ β ( ν jg − ν jg ) . PREPRINT - A

PRIL

21, 2020Combining two results, we have dξ j (˜ x ; θ ) dθ l − dξ j (˜ x ; θ ) dθ l = J X ˜ j =1 G X g =0 B ˜ jg ( x l )( ν ˜ jg − ν ˜ jg ) where B ˜ jg ( x l ) ≡ X j ′ ,j ′′ ∈ [ J ] Z ˜ β g ( l ) d jj ′′ (˜ θ ) ˜ β g s j ′′ ( ˜ β ; ˜ θ ) s j ′ ( ˜ β ; ˜ θ )(1 − s ˜ j ( ˜ β ; ˜ θ )) dF ˜ β x j ′ l + X j ′ ,j ′′ ∈ [ J ] Z ˜ β g ( l ) X j ,j ∈ [ J ] d j ′ j (˜ θ ) d j j ′′ (˜ θ ) ˜ β g s j ( ˜ β ; ˜ θ ) s j ( ˜ β ; ˜ θ )(1 − s ˜ j ( ˜ β ; ˜ θ )) s j ′′ ( θ )(1 − s j ′ ( θ )) dF ˜ β x j ′ l . For the second claim, observe that the absolute value of the ﬁrst term of B j ′ g ( x l ) is bounded above by JC max j ′ ∈ [ J ] | x j ′ l | Z | ˜ β g ( l ) ˜ β g | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ′ ,j ′′ ∈ [ J ] s j ′′ ( ˜ β ; ˜ θ ) s j ′ ( ˜ β ; ˜ θ )(1 − s j ( ˜ β ; ˜ θ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dF ˜ β ≤ C J max j ′ | x j ′ l | . because the crude bound of < s j ( ˜ β ; ˜ θ ) < says that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ′ ,j ′′ ∈ [ J ] s j ′′ ( ˜ β ; ˜ θ ) s j ′ ( ˜ β ; ˜ θ )(1 − s j ( ˜ β ; ˜ θ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | (1 − s ( ˜ β ; ˜ θ ) − s j ( ˜ β ; ˜ θ )) (1 − s j ( ˜ β ; ˜ θ )) + (1 − s ( ˜ β ; ˜ θ ) − s j ( ˜ β ; ˜ θ ))( s j ( ˜ β ; ˜ θ )(1 − s j ( ˜ β ; ˜ θ )) + s j ( ˜ β ; ˜ θ ) (1 − s j ( ˜ β ; ˜ θ )) |≤ | − s j ( ˜ β ; ˜ θ ) || s j ( ˜ β ; ˜ θ ) + s j ( ˜ β ; ˜ θ ) | ≤ . Similarly, the absolute value of the second term is bounded above by J C max j ′ ∈ [ J ] | x j ′ l | Z | ˜ β g ( l ) ˜ β g | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ′ ,j ′′ ,j ,j ∈ [ J ] s j ( ˜ β ; ˜ θ ) s j ( ˜ β ; ˜ θ )(1 − s j ( ˜ β ; ˜ θ )) s j ′′ ( ˜ β ; ˜ θ )(1 − s j ′ ( ˜ β ; ˜ θ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dF ˜ β ≤ J C max j ′ ∈ [ J ] | x j ′ l | Z | ˜ β g ( l ) ˜ β g | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s ( ˜ β ; ˜ θ ) X j ′′ ,j ,j ∈ [ J ] s j ( ˜ β ; ˜ θ ) s j ( ˜ β ; ˜ θ )(1 − s j ( ˜ β ; ˜ θ )) s j ′′ ( ˜ β ; ˜ θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dF ˜ β ≤ C J max j ′ ∈ [ J ] | x j ′ l | . since | d j j ( θ ) d j j ( θ ) | ≤ | d j j ( θ ) || d j j ( θ ) | . Therefore, the statement claimed follows for a constant ¯ C ≥ max { C , C } . Corollary 1.

In addition to the assumptions for lemma 4, suppose that1. sup ∆ θ ∈ Θ ,j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] E n V ar (( f jk ( ˜ X ; θ + ∆ θ ) − f jk ( ˜ X ; θ )) f j ′ k ′ ( ˜ X ; θ )) ≤ B n , and2. max j,j ′ ∈ [ J ] ,l ∈ [ L ] ,k,k ′ ∈ [ K ] E n ( X jl h jk ( W ) h j ′ k ′ ( W ) ξ j ( ˜ X ; θ )) ≤ B n with probability at least − δ n / ,then, sup θ ∈R ( θ ) ,j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] | G n ( f j,k ( ˜ X ; θ ) − f j,k ( ˜ X ; θ )) f j ′ ,k ′ ( ˜ X ; θ ) |≤ n − / C ( B n + ( J G )(2 √ B n k θ − θ k log / (8 J GKL/δ n ) with probability at least − δ n with a universal constant C .Proof. All the arguments in the proof of theorem 1 applies by replacing h jk ( W i ) terms with h jk ( W i ) h j ′ k ′ ( W i ) ξ j ′ k ′ ( ˜ X ; θ ) . Corollary 2.

In addition to the assumptions for lemma 6, suppose that PREPRINT - A

PRIL

21, 2020 sup ∆ θ ∈ Θ ,j ∈ [ J ] ,k ∈ [ K ] ,l ∈ [2 L ] E n V ar ( G jk,l ( ˜ X ; θ + ∆ θ ) − G jk,l ( ˜ X ; θ )) ≤ B n , and2. max j ∈ [ J ] ,l ∈ [ L ] ,k ∈ [ K ] E n ( h jk ( W ) X jl max j ′ ∈ [ J ] | X j ′ l | ) ≤ B n with probability at least − δ n / ,then, sup j,j ′ ∈ [ J ] ,k,k ′ ∈ [ K ] ,l ∈ [2 L ] | G n ( G jk,l ( ˜ X ; ˆ θ ) − G jk,l ( ˜ X ; θ )) k ∞ ≤ n − / C ( B n +( J G )(2 √ B n k ˆ θ − θ k log / (8 J GKL/δ n ))) with probability at least − δ n with a universal constant C .Proof. By lemma 6, the gradient functions G jk,l ( ˜ X ; θ ) can be linearly expanded with respect to ν jg − ν jg indicesand their coefﬁcients depend on l only through the corresponding sub-vector of covariates X l . Therefore, all thearguments in the proof of theorem 1 applies by replacing f jk ( ˜ X ; θ ) terms with G jk,l ( ˜ X ; θ ) and h jk ( W i ) terms with h jk ( W i ) max j ′ ∈ [ J ] | X j ′ ,l | . Corollary 3. (Based on Ledoux and Talagrand (1991, theorem 4.12)) Let F : R + → R + be convex and increasing.Let N be a subset of R nJ and N i , N j , and N ij for each i ∈ [ n ] and j ∈ [ J ] be i, j, ( ij ) -th coordinates of N . Let σ = { σ i } i ∈ n be independent Rademacher random variables taking {− , } with equal probability. Let φ i : N i → R be functions such that | φ i ( ν i ) | ≤ and | φ i ( ν i ) ν ij − φ i ( ν ′ i ) ν ′ ij | ≤ | ν ij − ν ′ ij | uniformly over ν i ∈ N i , ν ij , ν ′ ij ∈ N ij for every i ∈ [ n ] and j ∈ [ J ] . Then E " F

12 sup ν ∈N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 σ i φ i ( ν i ) ν ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)! ≤ E " F sup ν j ∈N j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 σ i ν ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)! . Proof.

The result follows from the proof of the Ledoux and Talagrand (1991), Theorem 4.12. Below, we state amodiﬁed sketch of the original proof. First, we want to show that E " G sup ν ∈N n X i =1 σ i φ i ( ν i ) ν ij ! ≤ E " G sup ν j ∈N j n X i =1 σ i ν ij ! for convex and increasing G : R → R . Once the above inequality holds, we would achieve the stated inequality by thesymmetry of the distribution of the random variables multiplied with Rademacher variables.We show the above inequality by conditioning and iteration. Let σ i>j ≡ { σ j , . . . , σ n } . Now, order the n − j supportvalues of σ i>j . Let σ ri>j be a r th value in the ordered support values of σ i>j . As the Rademacher variables areindependent, E " G sup ν ∈N n X i =1 σ i φ i ( ν i ) ν ij ! = n − X r =1 E " G sup ν ∈N σ φ ( ν ) ν j + n X i> σ ri φ i ( ν i ) ν ij !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ ri> (cid:19) n − = n − X r =1 E " G sup ν ∈N σ φ ( ν ) ν j + n X i> σ ri φ i ( ν i ) ν ij ! (cid:19) n − . If E (cid:20) G (cid:18) sup ν ∈N ,t ∈ R σ φ ( ν ) ν j + t (cid:19)(cid:21) ≤ E " G sup ν j ∈N j ,t ∈ R σ ν j + t ! then we have n − X r =1 E " G sup ν ∈N σ φ ( ν ) ν j + n X i> σ ri φ i ( ν i ) ν ij ! (cid:19) n − ≤ n − X r =1 E " G sup ν j ∈N j ,ν − ∈N − σ ν j + n X i> σ ri φ i ( ν i ) ν ij ! (cid:19) n − , therefore, we achieve the target inequality by iterating over r > .Now we show for all t , s ∈ N and t , s ∈ N , G ( s j − φ ( s ) s j ) + 12 G ( t j + φ ( t ) t j ) ≤ G ( s j − s j ) + 12 G ( t j + t j ) . PREPRINT - A

PRIL

21, 2020The remaining argument follows essentially the same argument of the proof of Ledoux and Talagrand (1991) but thefact that φ ( s ) takes a vector argument. Nevertheless, a similar argument applies because it is uniformly bounded byconstant. First, we may assume that t j + φ ( t ) t j ≥ s j + φ ( s ) s j and s j − φ ( s ) s j ≥ t j − φ ( t ) t j otherwise the two separate supremum under σ = 1 and σ = − is solved as a single supremum under commonvariables either ( t , t ) or ( s , s ) only. We distinguish between the following cases. When t j ≥ s j ≥ , we have t j + φ ( t ) t j − s j + s j ≥ s j + φ ( s ) s j − s j + s j ≥ s j − | φ ( s ) | s j =( −| φ ( s ) | + 1) s j ≥ , and s j − φ ( s ) s j ≤ t j − φ ( t ) t j from | φ ( t ) t j − φ ( s ) s j | ≤ | t j − s j | and t j ≥ s j . Therefore, we have G ( s j − φ ( s ) s j ) − G ( s j − s j ) ≤ G ( s j − s j + (1 − φ ( s )) s j ) − G ( s j − s j ) ≤ G ( t j + φ ( t ) t j + (1 − φ ( s )) s j ) − G ( t j + φ ( t ) t j ) ≤ G ( t j + s j − φ ( s ) s j + φ ( t ) t j ) − G ( t j + φ ( t ) t j ) ≤ G ( t j + t j − φ ( t ) t j + φ ( t ) t j ) − G ( t j + φ ( t ) t j ) ≤ G ( t j + t j ) − G ( t j + φ ( t ) t j ) as G ( · + x ) − G ( · ) is increasing for any x ≥ . Thus, the desired inequality is achieved.The same argument applies with t replaced with s and φ into − φ . The parallel argument holds when t j ≤ s j ≤ .When t j ≥ and s j ≤ , G ( t j + φ ( t ) t j ) − G ( t j + t j ) ≤ G ( t j + | φ ( t ) | t j ) − G ( t j + t j ) ≤ and G ( s j − φ ( s ) s j ) − G ( s j − s j ) ≤ G ( s j − | φ ( s ) | s j ) − G ( s j − s j ) ≤ . The parallel argument applies when t j ≤ and s j ≥0