[PDF] Linear Models are Most Favorable among Generalized Linear Models

Abstract

We establish a nonasymptotic lower bound on the L 2 minimax risk for a class of generalized linear models. It is further shown that the minimax risk for the canonical linear model matches this lower bound up to a universal constant. Therefore, the canonical linear model may be regarded as most favorable among the considered class of generalized linear models (in terms of minimax risk). The proof makes use of an information-theoretic Bayesian Cramér-Rao bound for log-concave priors, established by Aras et al. (2019).

Full PDF

aa r X i v : . [ m a t h . S T ] J un Linear Models are Most Favorable amongGeneralized Linear Models

Kuan-Yun Lee and Thomas A. Courtade

Department of Electrical Engineering and Computer SciencesUniversity of California, Berkeley{timkylee, courtade}@berkeley.edu

Abstract —We establish a nonasymptotic lower bound on the L minimax risk for a class of generalized linear models. It is furthershown that the minimax risk for the canonical linear modelmatches this lower bound up to a universal constant. Therefore,the canonical linear model may be regarded as most favorableamong the considered class of generalized linear models (interms of minimax risk). The proof makes use of an information-theoretic Bayesian Cramér-Rao bound for log-concave priors,established by Aras et al. (2019). I. I

NTRODUCTION AND M AIN R ESULTS

As their name suggests, generalized linear models (GLMs)are a ﬂexible class of parametric statistical models that gener-alize the class of linear models relating a random observation X ∈ R n to a parameter θ ∈ R d via the linear relation X = M θ + Z, (1)where M ∈ R n × d is a known (ﬁxed) design matrix, and Z ∈ R n is a random noise vector. For a univariate GLM incanonical form with natural parameter η ∈ R , the density ofobservation X ∈ R given η is expressed as the exponentialfamily f ( x ; η ) = h ( x ) exp (cid:18) ηx − Φ( η ) s ( σ ) (cid:19) , for known functions h : X ⊆ R → [0 , ∞ ) (the base measure ), Φ : R → R (the cumulant function ) and a scale parameter s ( σ ) > . For this general class of models, the question ofcentral importance is how well one can estimate η from anobservation X ∼ f ( · ; η ) , where f ( · ; η ) is understood to bea density on a probability space ( X ⊆ R , F ) with respectto a dominating σ -ﬁnite measure λ . This class of modelscaptures a wide variety of parametric models such as binomial,Gaussian, Poisson, etc. As a speciﬁc example, we can take X = { , , , . . . } equipped with the counting measure λ .For h ( x ) = 1 /x ! , Φ( t ) = e t and s ( σ ) = 1 , the observation X ∼ f ( · ; η ) will be Poisson( e η ) .In this paper, we restrict our attention to multivariate GLMsof the form f ( x ; θ ) = n Y i =1 (cid:26) h ( x i ) exp (cid:18) x i h m i , θ i − Φ( h m i , θ i ) s ( σ ) (cid:19)(cid:27) , (2)for a real parameter θ ∈ R d and a ﬁxed design matrix M ∈ R n × d , with rows given by the vectors { m i } ni =1 ⊂ R d . Inwords, the above model assumes each X i is drawn from thesame exponential family, with respective natural parameters h m i , θ i , i = 1 , , . . . , n . This captures the linear model (1)in the usual case where the noise vector Z is assumed to bestandard normal on R n , but is also ﬂexible enough to capturemany other models of interest.In terms of parameter estimation, a typical ﬁgure of meritis the constrained L minimax risk, which corresponds to theworst-case L estimation error, where θ is allowed to rangeover a constrained set Θ . For our purposes, we take Θ equalto the Euclidean ball in R d , denoted B d (1) := { v : v ∈ R d , k v k ≤ } , which is a common choice in applications.More precisely, we make the following deﬁnition. Deﬁnition 1.

For the generalized linear model (2) , we deﬁnethe associated minimax risk via R ∗ ( h, Φ , M, s ( σ )) := inf ˆ θ sup θ ∈ B d (1) E k θ − ˆ θ k , where the expectation is over X ∼ f ( · ; θ ) , and the inﬁmum isover all R d -valued estimators ˆ θ (i.e., measurable functions ofthe observation X ). Before we state our main results, we make the followingassumption throughout:

Assumption 1.

We assume the cumulant function

Φ : R → R in (2) is twice-differentiable, with second derivative uniformlybounded as Φ ′′ ≤ L , for some L > . Remark 2.

This assumption is standard in the literature onminimax estimation for GLMs, and is equivalent to the map θ E X ∼ f ( · ; θ ) [ X ] being L -Lipschitz. See, for example, [1]–[4]. Our ﬁrst main result is a general lower bound on theminimax risk for the class of GLMs introduced above.

Theorem 3.

The L minimax risk for the class of models (2) is lower bounded according to R ∗ ( h, Φ , M, s ( σ )) & min (cid:18) s ( σ ) L Tr (cid:0) ( M ⊤ M ) − (cid:1) , (cid:19) , (3) where & denotes inequality, up to a universal constant. Remark 4.

In case M ⊤ M is not invertible, we adopt theconvention that Tr (cid:0) ( M ⊤ M ) − (cid:1) = + ∞ . This situation occurswhen M is not full rank, in which case θ is not identiﬁablein the null space of M and constant error is unavoidable. emark 5. In fact, with minor modiﬁcation, Theorem 3 holdsfor the more general class of GLMs with observations drawnfrom densities of the form f ( x ; θ ) = n Y i =1 (cid:26) h i ( x i ) exp (cid:18) x i h m i , θ i − Φ i ( h m i , θ i ) s i ( σ ) (cid:19)(cid:27) . See Section II-C for further remarks.

Remark 6.

Since minimax risk is generally characterizedmodulo universal constants, the statement (3) in terms of & is sufﬁcient for our purposes. However, a careful analysis ofour arguments reveals that & can be replaced with ≥ at theexpense of including a modest constant in the RHS of (3) (e.g., / ( πe ) ). Most interestingly, the minimax bound (3) holds uniformlyover the class of GLMs given by (2), and is of the correctorder for the canonical linear model (1). Indeed, under thelinear model X = LM θ + Z , where Z is standard Gaussianwith covariance σ L · I and the design matrix M is full rank,the maximum likelihood estimator (MLE) estimator ˆ θ MLE isgiven by ˆ θ MLE = L − ( M ⊤ M ) − M ⊤ X. One can explicitly calculate the L error as E k θ − ˆ θ MLE k = E k θ − L − ( M ⊤ M ) − M ⊤ X k = 1 L E k ( M ⊤ M ) − M ⊤ Z k = σ L Tr(( M ⊤ M ) − ) . (4)The linear model in this case corresponds to h ( x ) = e − x / (2 Lσ ) , s ( σ ) = σ , and Φ( t ) = Lt / in (2).Comparing (4) to Theorem 3, we ﬁnd that our minimaxlower bound is achieved (up to a universal constant) forlinear models of the above form. To summarize, we have thefollowing: Corollary 7.

Fix a design matrix M , scale parameter s ( σ ) and L > . Among those generalized linear models in (2) with Φ ′′ ≤ L , linear models are most favorable in terms ofminimax risk. More precisely, among this class of models, R ∗ ( h, Φ , M, s ( σ )) & R ∗ ( e − ( · ) / (2 Ls ( σ )) , ( · ) L/ , M, s ( σ )) . Roughly speaking, the above asserts that linear models aremost favorable among a broad class of GLMs, giving thispaper its name.

A. Related Work

Perhaps most closely related to our work is that ofAbramovich and Grinshtein [1], albeit for a slightly differ-ent setup. In particular, Abramovich and Grinshtein provideminimax lower bounds for the Kullback-Leibler divergencebetween the vector

M θ and any estimator d M θ under a k -sparse setting k θ k ≤ k , with the parameter θ constrained tohave at most k non-zero entries. When the cumulant function Φ is strongly convex with < R ≤ Φ ′′ ≤ L for some ﬁxed constants R, L , we can adapt the arguments of [1] to obtainthe following L minimax lower bound inf d Mθ sup θ ∈ B d (1) k M θ − d M θ k & ds ( σ ) RL · λ min ( M ⊤ M ) λ max ( M ⊤ M ) , where M is assumed to be full rank and λ min and λ max denotesmallest and largest eigenvalues, respectively. The minimaxlower bound for estimating M θ is not directly comparableto our result, where the goal is estimation of θ . Neverthe-less, using the operator norm inequality k M ( θ − ˆ θ ) k ≤ λ max ( M ⊤ M ) k θ − ˆ θ k , we may conclude inf ˆ θ sup θ ∈ B d (1) k θ − ˆ θ k & ds ( σ ) RL · λ min ( M ⊤ M ) λ ( M ⊤ M ) . A direct computation shows that (3) is sharper than the above L minimax estimate since d λ min ( M ⊤ M ) λ ( M ⊤ M ) ≤ dλ max ( M ⊤ M ) ≤ Tr (cid:16)(cid:0) M ⊤ M (cid:1) − (cid:17) . As for a general theory, apart from the gaussian linearmodel, the minimax estimator for the GLM does not have aclosed form, but the Maximum Likelihood Estimator (MLE)can be approximated by iterative weighted linear regression[5]. A variety of estimators such as aggregate estimators [6],robust estimators [7] and GLM with Lasso [8] have beenproposed to solve different settings of the GLM. We referinterested readers to [9] for the theory of GLMs.Another line of related work explores models with stochasticdesign matrix M . Duchi, Jordan and Wainwright [10] considerinference of a parameter θ under privacy constraints. Negahbanet al. [3] and Loh et al. [4] provide consistency and conver-gence rates for M-estimators in GLMs with low-dimensionalstructure under high-dimensional scaling.Separate from the minimax problems considered here,model selection is another line of popular work. Model se-lection in linear regression dates back to the seventies and hasregained popularity over the past decade, due to the increase inneed of data exploration for high dimensional data; see [11]–[13] and many other works for the history. More recently, toolsin model selection for linear regression have been adapted forthe GLM; see [1] for a brief discussion. B. Organization

The remainder of this paper is organized as follows. Pre-liminaries for the derivation of our minimax lower bounds areintroduced in Section II-A. The proof of Theorem 3 is givenin Section II-B, with further remarks in Section II-C.II. D

ERIVATION OF M INIMAX B OUND FOR THE

GLMThe following notation is used throughout: upper-case let-ters (e.g., X , Y ) denote random variables or matrices, andlower-case letters (e.g., x, y ) denote realizations of randomvariables or vectors. We use subscript notation v i to denote the i -th component of a vector v = ( v , v , . . . , v d ) , and we deﬁnethe leave-one-out vector v ( j ) := ( v , . . . , v j − , v j +1 , . . . , v d ) . . Preliminaries In the general framework of parametric statistics, let ( X , F , P θ ; θ ∈ R d ) be a dominated family of probabilitymeasures on a measurable space ( X , F ) with dominating σ -ﬁnite measure λ . To each P θ , we associate a density f ( · ; θ ) with respect to λ according to dP θ ( x ) = f ( x ; θ ) dλ ( x ) . (5)Assuming the maps θ f ( x ; θ ) , x ∈ X , are differentiable,the Fisher information matrix associated to observation X ∼ f ( · ; θ ) and parameter θ ∈ R d is deﬁned as the matrix-valuedmap θ X ( θ ) with components [ I X ( θ )] ij = E (cid:20) ∂ log f ( X ; θ ) ∂θ i ∂ log f ( X ; θ ) ∂θ j (cid:21) , θ ∈ R d . Here and throughout, log denotes the natural logarithm. Thefollowing regularity assumption is standard when dealing withFisher information.

Assumption 2.

The densities f ( · ; θ ) are sufﬁciently regularto permit the following exchange of integration and differen-tiation: Z X ∇ θ f ( x ; θ ) dλ ( x ) = 0; θ ∈ R d . (6) Here, ∇ θ denotes the gradient with respect to θ . While the Fisher information is one notion of information that an observation X ∼ f ( · ; θ ) reveals about the unknownparameter θ , it also makes sense to consider the usual mu-tual information I ( X ; θ ) under the further assumption that θ is distributed according to a known prior distribution π (aprobability measure on R d ). Recent results by the authorstogether with Aras and Pananjady establish a quantitativerelation between these two notions of information [14]. Tostate the result precisely, recall that a probability measure dµ = e − V dx on R d is said to be log-concave if the potential V : R d → R is a convex function. Lemma 8 ( [14, Theorem 2]) . Let θ ∼ π , where π is log-concave on R d , and given θ let X ∼ f ( · ; θ ) . If Assumption 2holds, then I ( X ; θ ) ≤ d · φ (cid:18) Tr (Cov( θ )) · Tr ( E I X ( θ )) d (cid:19) , (7) where φ ( x ) := ( √ x if ≤ x ≤

11 + log x if x ≥ . As discussed extensively in [14], the above result is relatedto the van Trees inequality [15], [16], and its entropic im-provement due to Efroimovich [17]. The crucial feature of(7) compared to these other results is that it does not dependon the (information theorist’s version of) Fisher informationof the prior π , commonly denoted J ( π ) . This is what isgained via the assumption of log-concavity, and is importantfor our analysis where we introduce (log-concave) priors witharbitrarily large Fisher information. B. Proof of Theorem 3

Recall that the design matrix M has as its rows { m i } ni =1 ⊂ R d . Writing the matrix M in terms of its SVD M = U Σ V ⊤ and deﬁning u i as the i -th column of the matrix U ⊤ , we have h m i , θ i = h Σ u i |{z} ¯ m i , V ⊤ θ |{z} ¯ θ i = h ¯ m i , ¯ θ i , (8)where we deﬁned the variables ¯ m i := Σ u i and ¯ θ := V ⊤ θ .Since V is an orthogonal matrix by deﬁnition, it followsby rotation invariance of the L ball B d (1) that the esti-mation problem can be equivalently formulated under thereparametrization ( θ, M ) −→ (¯ θ, ¯ M ) , where ¯ M := M V = U Σ . More speciﬁcally, the minimax risk for θ over the set ofestimators for estimating θ ∈ B d (1) is equal to the minimaxrisk for estimating ¯ θ ∈ B d (1) . More precisely, inf ˆ θ sup θ ∈ B d (1) E k θ − ˆ θ k = inf ˆ¯ θ sup ¯ θ ∈ B d (1) E k ¯ θ − ˆ¯ θ k . As a result, we may assume without loss of generality that M ⊤ M is a diagonal matrix.By deﬁnition, minimax risk is lower bounded by the Bayesrisk, when θ is assumed to be distributed according to aprior π , deﬁned on the L ball B d (1) . Hence, our task isto judiciously select a prior π that yields the desired lowerbound. Toward this end, we will let π be the uniform measureon the rectangle Q di =1 [ − ǫ i / , ǫ i / for values ( ǫ i ) i =1 , ,...,d tobe determined below satisfying d X i =1 ǫ i ≤ . (9)In other words, our construction implies θ has independentcomponents, with the i -th coordinate θ i uniform on the interval [ − ǫ i / , ǫ i / . The interval lengths will, in general, be chosento exploit the structure of the design matrix M .We now describe our construction of the sequence ( ǫ i ) i =1 , ,...,d . We start with the simple case, in which thematrix M does not have full (column) rank. In this case, thereexists an eigenvalue λ k ( M ⊤ M ) = 0 . For this index k , we set ǫ i = 2 δ ik , i = 1 , , . . . , d , where δ ij is the Kronecker deltafunction. Now, we may bound E k θ − ˆ θ k ≥ Var( θ k − ˆ θ k ) ( a ) ≥ πe e h ( θ k − ˆ θ k ) ( b ) ≥ πe e h ( θ k | ˆ θ k ) = 12 πe e h ( θ k ) − I (ˆ θ k ; θ k ) ( c ) ≥ πe e h ( θ k ) − I ( X ; θ k )( d ) = 2 πe where (a) follows from the max-entropy property of gaussians;(b) follows since conditioning reduces entropy: h ( θ k − ˆ θ k ) ≥ h ( θ k − ˆ θ k | ˆ θ k ) = h ( θ k | ˆ θ k ) ; (c) follows from the data processinginequality since θ k → X → ˆ θ k forms a Markov chain; and(d) follows since θ k ∼ Unif( − , and I ( X ; θ k ) = 0 , since π is supported in the kernel of M by construction.aving shown the minimax risk is lower bounded by aconstant when M does not have full (column) rank, we assumehenceforth that M has full rank.Note that under our assumptions, the pair ( X, θ ) has a jointdistribution, and therefore so does the pair ( X, θ i ) . Consistentwith the previously introduced notation, we write I X ( θ i ) todenote the Fisher information of X drawn according to theconditional law of X given θ i . With this notation in hand,the next lemma provides a comparison between the expectedFisher information conditioned on a single component θ i ofthe parameter θ and the i -th diagonal entry of the expectedFisher information matrix conditioned for parameter θ . Lemma 9.

When the components of parameter θ ∼ π , θ ∈ R d are independent and X ∼ f ( · ; θ ) is generated by the GLM (2) ,we have E [ I X ( θ )] ii ≥ E I X ( θ i ) i = 1 , , . . . , d. Proof.

The desired estimate is obtained by observing E [ I X ( θ )] ii = E  (cid:16) ∂∂θ i f ( X ; θ ) (cid:17) f ( X ; θ )  ( a ) ≥ E  (cid:16) E h ∂∂θ i f ( X ; θ ) (cid:12)(cid:12)(cid:12) θ i , X i(cid:17) ( E [ f ( X ; θ ) | θ i , X ])  ( b ) = E  (cid:16) ∂∂θ i E [ f ( X ; θ ) | θ i , X ] (cid:17) ( E [ f ( X ; θ ) | θ i , X ])  = E I X ( θ i ) . In the above, (a) is due to Cauchy-Schwarz. Indeed, let π i and π ( i ) denote the marginal laws of θ i and θ ( i ) , respectively.Using independence of θ i and θ ( i ) , note that E  (cid:16) ∂∂θ i f ( X ; θ ) (cid:17) f ( X ; θ )  = Z R Z X Z R d − (cid:16) ∂∂θ i f ( x ; θ ) (cid:17) f ( x ; θ ) dπ ( i ) ( θ ( i ) ) dλ ( x ) dπ i ( θ i ) ≥ Z R Z X (cid:16)R R d − ∂∂θ i f ( x ; θ ) dπ ( i ) ( θ ( i ) ) (cid:17) R R d − f ( x ; θ ) dπ ( i ) ( θ ( i ) ) dλ ( x ) dπ i ( θ i )= E  (cid:16) E h ∂∂θ i f ( X ; θ ) (cid:12)(cid:12)(cid:12) θ i , X i(cid:17) ( E [ f ( X ; θ ) | θ i , X ])  , where the last line follows since x E [ f ( X ; θ ) | θ i , X = x ] = Z R d − f ( x ; θ ) dπ ( i ) ( θ ( i ) ) is the density (w.r.t. λ ) of X given θ i .Equality (b) follows from independence between θ i and θ ( i ) and the Leibniz integral rule. Application of the latter can bejustiﬁed by the assumed regularity of Φ and compactness of B d (1) . Next, ﬁx ǫ i > . Since θ i ∼ Unif( − ǫ i / , ǫ i / has log-concave distribution, and the GLM (2) satisﬁes Assumption 2(a consequence of Assumption 1 and the Leibniz integral rule,justiﬁed by regularity of Φ ), we can apply Lemmas 8 and 9to conclude e h ( θ i | ˆ θ i ) ≥ e h ( θ i ) − I ( X ; θ i ) ≥ e h ( θ i ) − φ (Var( θ i ) · E I X ( θ i )) ≥ ǫ i e − φ (cid:18) ǫ i E [ I X ( θ ))] ii (cid:19) . (10)Note that the last inequality used the identities Var( θ i ) = ǫ i and h ( θ i ) = log( ǫ i ) , holding by construction.Next, recall the following well-known identities associatedwith exponential families of the form we consider. Lemma 10 ( [9, Page 29]) . Fix m and θ , and consider adensity f ( x ; θ ) = h ( x ) exp (cid:16) x h m,θ i− Φ( h m,θ i ) s ( σ ) (cid:17) with respect to λ . A random observation X ∼ f ( · ; θ ) has mean Φ ′ ( h m, θ i ) and variance s ( σ ) · Φ ′′ ( h m, θ i ) . Combining the above with our assumption that Φ ′′ ≤ L , wehave for any θ ∈ R d , [ I X ( θ )] ii = E X ∼ f ( · ; θ ) (cid:18) ∂∂θ i log f ( X ; θ ) (cid:19) = 1 s ( σ ) E X ∼ f ( · ; θ )  n X j =1 M ji ( X j − Φ ′ ( h m j , θ i ))  = 1 s ( σ ) n X j =1 (cid:0) M ji Var( X j ) (cid:1) ≤ s ( σ ) n X j =1 (cid:0) M ji L (cid:1) = Ls ( σ ) [ M ⊤ M ] ii . (11)Putting (10) and (11) together, we conclude for any choiceof ǫ i > , e h ( θ i | ˆ θ i ) ≥ ǫ i exp (cid:20) − φ (cid:18) ǫ i Ls ( σ ) [ M ⊤ M ] ii (cid:19)(cid:21) . (12)In case ǫ i = 0 , we have the trivial equality e h ( θ i | ˆ θ i ) = 0 ,which is consistent with the RHS of (12) evaluated at ǫ i = 0 .Hence, the estimate (12) holds for all ǫ i ≥ .Summing (12) from i = 1 , , . . . , d , for parameter θ ∼ π = Q di =1 Unif( − ǫ i / , ǫ i / and any measurable function ˆ θ of X ∼ f ( · ; θ ) , we have the following lower bound on theBayesian L risk, E k θ − ˆ θ k ≥ d X i =1 Var( θ i − ˆ θ i ) ≥ πe d X i =1 e h ( θ i | ˆ θ i ) πe d X i =1 ǫ i exp (cid:20) − φ (cid:18) ǫ i Ls ( σ ) [ M ⊤ M ] ii (cid:19)(cid:21) . (13)It remains to choose an appropriate sequence ( ǫ i ) i =1 , ,...,d to obtain the desired lower bound. Toward this end, weconsider two cases: Case 1:

Tr(( M ⊤ M ) − ) ≤ Ls ( σ ) .In this case, we choose ǫ i = 12 s ( σ ) L (cid:0) [ M ⊤ M ] ii (cid:1) − for i = 1 , , . . . , d . Note that by our assumption that M ⊤ M isdiagonal, d X i =1 ǫ i = 12 s ( σ ) L Tr(( M ⊤ M ) − ) ≤ , so that (9) is satisﬁed. By an application of (13), we have E k θ − ˆ θ k & d X i =1 ǫ i exp (cid:20) − φ (cid:18) ǫ i Ls ( σ ) [ M ⊤ M ] ii (cid:19)(cid:21) = 12 e s ( σ ) L d X i =1 M ⊤ M ] ii & s ( σ ) L Tr(( M ⊤ M ) − ) . Case 2:

Tr(( M ⊤ M ) − ) > Ls ( σ ) .This case is the more difﬁcult of the two. We shall makeuse of the following technical Lemma. Lemma 11.

Let ( a i ) i =1 , ,...,d be any positive sequence satisfy-ing P di =1 a − i > . Then, there exists a non-negative sequence ( ǫ i ) i =1 , ,...,d such that P di =1 ǫ i ≤ and P di =1 ǫ i e − φ ( ǫ i a i ) ≥ e − .Proof. Without loss of generality, assume that a ≥ a ≥· · · ≥ a d > . If a ≤ / , then taking ( ǫ , ǫ , . . . , ǫ d ) =(2 , , , . . . , , and noticing that φ is an increasing function,we conclude d X i =1 ǫ i e − φ ( ǫ i a ) = 4 e − φ (4 a ) ≥ e − φ (1) > e − . Now, in the following we assume that a > / . Let t denotethe largest integer k ∈ { , , . . . , d } satisfying P ki =1 a − i ≤ .By the assumption that P di =1 a − i > , we know that therealways exists such a t , and t will satisfy t < d . We set ǫ i = (cid:26) a − / i if ≤ i ≤ t otherwise i = 1 , , . . . , d. (14)By deﬁnition, P di =1 ǫ i = P ti =1 a − i ≤ satisﬁes (9). Thisprocedure results in d X i =1 ǫ i e − φ ( ǫ i a i ) = e − t X i =1 a i . If P ti =1 a − i ≥ , we can immediately see from the aboveand (14) that P di =1 ǫ i e − φ ( ǫ i a i ) ≥ e − . On the other hand, if P ti =1 a − i < , this implies that a − t +1 ≥ . In this case, we simply take ǫ t +1 = 2 , and take ǫ i = 0 for i = t + 1 . With this choice, we have d X i =1 ǫ i e − φ ( ǫ i a i ) = 4 e − φ (4 a t +1 ) ≥ e − φ (2) = 2 e − . The above discussion concludes the proof of Lemma 11.By considering the values a i = L s ( σ ) [ M ⊤ M ] ii , Lemma11 ensures the existence of ( ǫ i ) i =1 , ,...,d satisfying (9) and,together with (13), gives E k θ − ˆ θ k & . This completes theproof of Theorem 3. C. Remarks

A few remarks are in order. First, we note that the argumentin the previous subsection actually yields the stronger entropicinequality, inf ˆ θ sup θ ∼ π d X i =1 e h ( θ i | ˆ θ i ) & min (cid:18) s ( σ ) L Tr (cid:0) ( M ⊤ M ) − (cid:1) , (cid:19) which improves Theorem 3 (seen by the max-entropy propertyof gaussians). Here, the supremum is taken over all distribu-tions π supported on the L ball B d (1) .Second, we remark that our analysis is ﬂexible enoughfor generalizations to other forms of the GLM. For example,consider observation X drawn from the density f ( x ; θ ) = n Y i =1 (cid:26) h i ( x i ) exp (cid:18) x i h m i , θ i − Φ i ( h m i , θ i ) s i ( σ ) (cid:19)(cid:27) . Suppose Assumption 1 holds for each cumulant function Φ i (i.e., Φ ′′ i ≤ L for each i = 1 , . . . , n ). Then, a slightmodiﬁcation in (11) yields [ I X ( θ )] ii ≤ Ls ∗ ( σ ) [ M ⊤ M ] ii where s ∗ ( σ ) = min i =1 , ,...,n s i ( σ ) . Following (13) and thesame choice of ( ǫ i ) i =1 , ,...,d in Section II-B with the argument s ( σ ) replaced by s ∗ ( σ ) , we obtain minimax lower bound inf ˆ θ sup θ ∈ B d (1) E k θ − ˆ θ k & min (cid:18) s ∗ ( σ ) L Tr (cid:0) ( M ⊤ M ) − (cid:1) , (cid:19) . In the special case where s ( σ ) = . . . = s n ( σ ) , the sameminimax lower bound as Theorem 3 is recovered.A CKNOWLEDGEMENTS

The authors thank Ashwin Pananjady for useful discussions.This work was supported in part by NSF grants CCF-1704967,CCF-0939370 and CCF-1750430.

EFERENCES[1] F. Abramovich and V. Grinshtein, “Model Selection and MinimaxEstimation in Generalized Linear Models,”

IEEE Transactions on In-formation Theory , vol. 62, no. 6, pp. 3721–3730, 2016.[2] H.-G. Müller and U. Stadtmüller, “Generalized Functional Linear Mod-els,” the Annals of Statistics , vol. 33, no. 2, pp. 774–805, 2005.[3] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, “AUniﬁed Framework for High-Dimensional Analysis of M-estimatorswith Decomposable Regularizers,”

Statistical Science , vol. 27, no. 4,pp. 538–557, 2012.[4] P.-L. Loh and M. J. Wainwright, “Regularized M-estimators with Non-convexity: Statistical and Algorithmic Theory for Local Optima,”

TheJournal of Machine Learning Research , vol. 16, no. 1, pp. 559–616,2015.[5] J. A. Nelder and R. W. Wedderburn, “Generalized Linear Models,”

Journal of the Royal Statistical Society: Series A (General) , vol. 135,no. 3, pp. 370–384, 1972.[6] P. Rigollet, “Kullback–Leibler Aggregation and Misspeciﬁed General-ized Linear Models,”

The Annals of Statistics , vol. 40, no. 2, pp. 639–665, 2012.[7] E. Cantoni and E. Ronchetti, “Robust Inference for Generalized LinearModels,”

Journal of the American Statistical Association , vol. 96, no.455, pp. 1022–1030, 2001.[8] S. A. Van de Geer, “High-Dimensional Generalized Linear Models andthe Lasso,”

The Annals of Statistics , vol. 36, no. 2, pp. 614–645, 2008. [9] P. McCullagh,

Generalized Linear Models . Routledge, 2019.[10] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Minimax OptimalProcedures for Locally Private Estimation,”

Journal of the AmericanStatistical Association , vol. 113, no. 521, pp. 182–201, 2018.[11] H. Akaike, “Information Theory and An Extension of the MaximumLikelihood Principle,” in

Selected papers of Hirotugu Akaike . Springer,1998, pp. 199–213.[12] L. Birgé and P. Massart, “Minimal Penalties for Gaussian ModelSelection,”

Probability theory and related ﬁelds , vol. 138, no. 1-2, pp.33–73, 2007.[13] N. Verzelen, “Minimax Risks for Sparse Regressions: Ultra-High Di-mensional Phenomenons,”

Electronic Journal of Statistics , vol. 6, pp.38–90, 2012.[14] E. Aras, K.-Y. Lee, A. Pananjady, and T. A. Courtade, “A Familyof Bayesian Cramér-Rao Bounds, and Consequences for Log-ConcavePriors,”

ISIT , 2019.[15] R. D. Gill and B. Y. Levit, “Applications of the van Trees Inequality:a Bayesian Cramér-Rao Bound,”

Bernoulli , vol. 1, no. 1-2, pp. 59–79,1995.[16] H. L. Van Trees,

Detection, Estimation, and Modulation Theory, part I:Detection, Estimation, and Linear Modulation Theory . John Wiley &Sons, 1968.[17] S. Y. Efroimovich, “Information Contained in a Sequence of Observa-tions,”