[PDF] Strong Brascamp-Lieb Inequalities

Abstract

In this paper, we derive sharp nonlinear dimension-free Brascamp-Lieb inequalities (including hypercontractivity inequalities) for distributions on Polish spaces, which strengthen the classic Brascamp-Lieb inequalities. Applications include the extension of Mr. and Mrs. Gerber's lemmas to the cases of R\'enyi divergences and distributions on Polish spaces, the strengthening of small-set expansion theorems, and the characterization of the exponent of q-stability of Boolean functions. Our proofs in this paper are based on information-theoretic and coupling techniques.

Full PDF

11 Strong Brascamp-Lieb Inequalities

Lei Yu

Abstract

In this paper, we derive sharp nonlinear dimension-free Brascamp-Lieb inequalities (including hypercontractivityinequalities) for distributions on Polish spaces, which strengthen the classic Brascamp-Lieb inequalities. Applicationsinclude the generalization of Mr. and Mrs. Gerber’s lemmas to Rényi entropies and distributions on Polish spaces,the strengthening of small-set expansion theorems, and the characterization of the exponent of q -stability of Booleanfunctions. Our proofs in this paper are based on information-theoretic techniques and coupling techniques. Index Terms

Brascamp-Lieb, hypercontractivity, Mr. and Mrs. Gerber’s lemmas, isoperimetric inequalities, small-set expan-sion, noise stability

I. I

NTRODUCTION

Let X and Y be two Polish spaces, and B X and B Y the Borel σ -algebras on X and Y respectively. Let ( X × Y , B X ⊗ B Y ) be the product of the measurable spaces ( X , B X ) and ( Y , B Y ) , and P XY a probability measure(also termed distribution) on ( X × Y , B X ⊗ B Y ) . Let ( X, Y ) ∼ P XY . The (forward and reverse) Brascamp-Lieb(BL) inequalities are as follows: Given the distribution P XY and p, q ∈ [ −∞ , ∞ ] , for any nonnegative measurablefunctions f : X → R ≥ , g : Y → R ≥ , E [ f ( X ) g ( Y )] ≤ e − C (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q (1) E [ f ( X ) g ( Y )] ≥ e − C (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q , (2)where C = C p,q and C = C p,q only depend on p, q given the distribution P XY . We denote the optimal exponents C p,q and C p,q for the distribution P XY as C ∗ p,q and C ∗ p,q .Note that BL inequalities are homogeneous, i.e., (1) and (2) still hold under the substitution f ← af, g ← bf for any constant reals a, b . Hence, without changing the optimal exponents, we can normalize both sides of theBL inequalities in (1) and (2) by (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) (cid:98) q for reals (cid:98) p (cid:54) = p, (cid:98) q (cid:54) = q . The resulting normalized versionof BL inequalities can be understood as a linear tradeoff between the correlation E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) (cid:98) q and the product (cid:107) f ( X ) (cid:107) p (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) q (cid:107) g ( Y ) (cid:107) (cid:98) q . In this paper, we aim at studying the optimal nonlinear tradeoff among E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) (cid:98) q , (cid:107) f ( X ) (cid:107) p (cid:107) f ( X ) (cid:107) (cid:98) p ,and (cid:107) g ( Y ) (cid:107) q (cid:107) g ( Y ) (cid:107) (cid:98) q . To this end, for p, (cid:98) p ∈ R \ { } such that p (cid:54) = (cid:98) p , we deﬁne the ( p, (cid:98) p ) -Rényi entropy of a nonnegativefunction f as H p, (cid:98) p ( f ) := p (cid:98) pp − (cid:98) p log (cid:107) f ( X ) (cid:107) p (cid:107) f ( X ) (cid:107) (cid:98) p . (3)The ( p, (cid:98) p ) -Rényi entropy for other cases are deﬁned by the continuous extension. Speciﬁcally, for p ∈ R \ { } , the ( p, p ) -Rényi entropy of f is deﬁned as H p,p ( f ) := H ( f p ) := E P [ f p log f p ] − E P [ f p ] log E P [ f p ] E P [ f p ] , (4)which coincides the Shannon entropy of f p . Note that the Shannon entropy here is slightly different from the usualone; the usual deﬁnition of Shannon entropy corresponds to the numerator in the RHS of (4). For p ∈ R , the ( p, -Rényi entropy and the (0 , p ) -Rényi entropy of f are deﬁned by H p, ( f ) , H ,p ( f ) := − log P X (supp ( f )) . L. Yu is with the School of Statistics and Data Science, Nankai University, Tianjin 300071, China (e-mail: [email protected]). a r X i v : . [ m a t h . F A ] F e b Given p, q, (cid:98) p, (cid:98) q ∈ R and P XY , we deﬁne the optimal forward and reverse BL exponents as ∆ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) := − sup f ∈F α ,g ∈G β log E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) (cid:98) q (5) ∆ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) := − inf f ∈F α ,g ∈G β log E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) (cid:98) q , (6)where F α := (cid:8) f : X → R ≥ : H p, (cid:98) p ( f ) = α (cid:9) and G β := (cid:8) g : Y → R ≥ : H q, (cid:98) q ( g ) = β (cid:9) .Since the ( p, (cid:98) p ) -Rényi entropy is symmetric w.r.t. ( p, (cid:98) p ) , it is easy to verify the following lemma. Lemma 1.

Both Λ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) := ∆ p,q, (cid:98) p, (cid:98) q ( α, β | P XY )+ α (cid:98) p + β (cid:98) q and Λ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) := ∆ p,q, (cid:98) p, (cid:98) q ( α, β | P XY )+ α (cid:98) p + β (cid:98) q are symmetric w.r.t. ( p, (cid:98) p ) and also symmetric w.r.t. ( q, (cid:98) q ) . Instead of ∆ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) and ∆ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) , equivalently, in this paper we aim at characterizing Λ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) and Λ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) . For brevity, sometimes we omit the underlying distribution P XY inthese notations, and write them as Λ p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ p,q, (cid:98) p, (cid:98) q ( α, β ) .Let P nXY be the product of n copies of P XY . For P nXY , we deﬁne the forward and reverse n -BL exponents respectively as Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) := n Λ p,q, (cid:98) p, (cid:98) q ( nα, nβ | P nXY ) and Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) := n Λ p,q, (cid:98) p, (cid:98) q ( nα, nβ | P nXY ) , as wellas, the forward and reverse asymptotic BL exponents respectively as Λ ( ∞ ) p,q, (cid:98) p, (cid:98) q ( α, β ) := lim n →∞ Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ ( ∞ ) p,q, (cid:98) p, (cid:98) q ( α, β ) := lim n →∞ Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) . Obviously, Λ (1) p,q, (cid:98) p, (cid:98) q ( α, β ) = Λ p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ (1) p,q, (cid:98) p, (cid:98) q ( α, β ) =Λ p,q, (cid:98) p, (cid:98) q ( α, β ) . Upon these deﬁnitions, two natural questions arise: Can we derive sharp dimension-free bounds(also termed single-letter bounds) for Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) ? Does the tensorization property holds for Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) (i.e., Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) = Λ p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) = Λ p,q, (cid:98) p, (cid:98) q ( α, β ) for any n )?In this paper, we study these two questions. Speciﬁcally, by information-theoretic methods, we derive dimension-free bounds for Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) , which are asymptotically sharp for the case of ﬁnite X , Y , as thedimension n → ∞ . We observe that these expressions differ from Λ p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ p,q, (cid:98) p, (cid:98) q ( α, β ) , which impliesthat the tensorization property does not hold in general.The work in this paper is motivated by the works in [1], [2]. As special cases of BL inequalities, the forwardhypercontractivity inequalities were strengthened in the same spirit in [1], [2]. Both works in [1], [2] only focused onstrengthening the single-function version of forward hypercontractivity. Polyanskiy and Samorodnitsky’s inequalitiesin [1] are sharp only for extreme cases, while Kirshner and Samorodnitsky only focused on binary symmetricdistributions in [2]. Furthermore, hypercontractivity inequalities for distributions on ﬁnite alphabets X , Y wererecently studied by the present author, Anantharam, and Chen [3]. Our work here generalizes all the existing worksto the BL inequalities for all ( p, q, (cid:98) p, (cid:98) q ) ∈ R \ { } and for arbitrary distributions on Polish spaces. Moreover, ourstrong BL inequalities are asymptotically sharp at least for distributions on ﬁnite alphabets.The forward version of BL inequalities was originally studied by Brascamp and Lieb in the 1970s [4], motivatedby problems from particle physics; the reverse version was initially studied by Barthe in [5]. Hypercontractivityinequalities are an important class of BL inequalities, which were sequentially investigated in [6]–[13]. Information-theoretic formulations of BL inequalities or hypercontractivity inequalities can be found in [11], [14]–[18]. Euclideanversions of BL inequalities and their generalizations were also studied in [19], [20]. A. Main Contributions

In this paper, we study nonlinear Brascamp-Lieb inequalities, and derive sharp dimension-free version of nonlinearBrascamp-Lieb inequalities (including hypercontractivity inequalities) for distributions on Polish spaces, whichstrengthen the classic Brascamp-Lieb inequalities. As applications of our nonlinear Brascamp-Lieb inequalities, westrengthen and generalize the Mr. and Mrs. Gerber’s lemmas [21], [22] to sharp Rényi entropy versions, and thesmall-set expansion theorems to strong versions for arbitrary distributions on Polish spaces. Also, by our nonlinearBrascamp-Lieb inequalities, we obtain sharp dimension-free bounds on the q -stability of Boolean functions. Ourproofs in this paper are based on information-theoretic techniques and coupling techniques. Throughout this paper, when we write a function f : X → R ≥ , by default, we mean that it is measurable. A similar convention alsoapplies to a function g : Y → R ≥ . We believe that it not difﬁcult to generalize our results in this paper to the nonlinear version of forward-reverseBrascamp-Lieb inequalities [18] by using similar proof methods. We believe that the resulting inequalities willbe involved, and hence, we do not discuss this generalization in this work. However, it is more interesting to thepresent author to study strengthened versions of other functional inequalities in the future.

B. Preliminaries and Notations

We use P X , Q X , R X , S X to denote probability measures on ( X , B X ) . For a joint probability measure Q XY on ( X × Y , B X ⊗ B Y ) , its marginal on X is denoted as Q X , and the regular conditional probability measure (Markovkernel) on ( X , B X ) given ( Y , B Y ) is denoted as Q X | Y , where Q X | Y = y := Q X | Y ( ·| y ) is a probability measure on ( X , B X ) for every y ∈ Y . We denote R Y R X | Y as the joint distribution induced by R Y and R X | Y , and R Y ◦ R X | Y asthe marginal distribution on X of the joint distribution R Y R X | Y . We use Q X (cid:28) P X to denote that the distribution Q X is absolutely continuous w.r.t. P X . We use X n or X to denote a random vector ( X , X , ..., X n ) deﬁned on (cid:0) X n , B ⊗ n X (cid:1) , and use x n or x to denote its realization. We use Q nX to denote the product of n copies of Q X , anduse Q X n or Q X to denote an arbitrary probability measure on (cid:0) X n , B ⊗ n X (cid:1) . For a conditional probability measure P X | Y , deﬁne the conditional expectation operator induced by P X | Y as P X | Y f ( y ) := E [ f ( X ) | Y = y ] for anymeasurable function f : ( X , B X ) → ( R , B R ) , where X ∼ P X | Y ( ·| y ) given Y = y . We use C ( Q X , Q Y ) todenote the set of couplings (joint probability measures) Q XY with marginals Q X , Q Y , and C (cid:0) Q X | UW , Q Y | V W (cid:1) to denote the set of conditional couplings Q XY | UV W with conditional marginals Q X | UW , Q Y | V W . Note that forany Q XY | UV W ∈ C (cid:0) Q X | UW , Q Y | V W (cid:1) , its marginals satisfy Q X | UV W = Q X | UW , Q Y | UV W = Q Y | V W , i.e., underthe conditional distribution Q XY | UV W , X → ( U, W ) → V and Y → ( V, W ) → U form Markov chains. Here wesay U → W → V forms a Markov chain if U, V are conditionally independent given W . For a sequence x n , weuse T x n to denote the type (i.e., empirical distribution) of x n .For s ∈ R \ { , } , we deﬁne D s ( Q (cid:107) P ) := s − log (cid:82) (cid:16) d Q d µ (cid:17) s (cid:16) d P d µ (cid:17) − s d µ for a nonnegative ﬁnite measure µ (infact, the value of D s ( Q (cid:107) P ) is independent of the choice of µ ), where d P d µ denotes the Radon–Nikodym derivativeof P w.r.t. µ . We assume that all Radon–Nikodym derivatives are nonnegative real-valued measurable functions.Here and throughout this paper, we adopt the conventions that log 0 = −∞ and log ∞ = ∞ . For s = 1 , we deﬁne D ( Q (cid:107) P ) = D ( Q (cid:107) P ) := (cid:82) log d Q d P d Q if Q (cid:28) P and (cid:82) (cid:12)(cid:12)(cid:12) log d Q d P (cid:12)(cid:12)(cid:12) d Q < ∞ ; otherwise, D ( Q (cid:107) P ) = D ( Q (cid:107) P ) := ∞ . We say D s ( Q (cid:107) P ) exists if D s ( Q (cid:107) P ) < ∞ . Recall that for a Lebesgue integral (cid:82) f d Q , we say this integralexists if (cid:82) | f | d Q < ∞ . For a function of distributions that involves Rényi divergences or Lebesgue integrals, wesay this function exists if all the Rényi divergences and Lebesgue integrals involved exist (and hence are ﬁnite).Throughout this paper, we use the following convention. Convention 1.

When we write an optimization problem with distributions as the variables, we by default requirethat the distributions satisfy that all the constraint functions and the objective function exist (and hence are ﬁnite).If there is no such a distribution, by default, the value of the optimization problem is set to ∞ if the optimizationis an inﬁmization, and −∞ if the optimization is a supremization. Obviously, we can write f p (cid:107) f (cid:107) pp as a Radon–Nikodym derivative d Q X d P X . Then, H p, (cid:98) p ( f ) = D (cid:98) p/p ( Q X (cid:107) P X ) . Henceby the properties of Rényi divergence in [23], we know that H p, (cid:98) p ( f ) ≥ for (cid:98) pp ≥ , and H p, (cid:98) p ( f ) ≤ for (cid:98) pp < . We refer to [23] for more properties of the Rényi divergence. Furthermore, it is easy to verify that H p, (cid:98) p ( f ) = H (cid:98) p,p ( f ) = H p/s, (cid:98) p/s ( f s ) = H (cid:98) p/s,p/s ( f s ) for s (cid:54) = 0 .Throughout this paper, when we talk about distributions Q X , R X , S X and conditional distributions Q X | W , R X | W , S X | W ,we mean that Q X , R X , S X , Q X | W = w , R X | W = w , S X | W = w (cid:28) P X for each w . Hence, for brevity, we omit theseunderlying conditions. The same convention also applies to Q Y , R Y , S Y , Q Y | W , R Y | W , S Y | W ( (cid:28) P Y ).We denote [ m : n ] := { m, m + 1 , ..., n } and [ n ] := [1 : n ] . For an n -length vector or sequence x n and a subset J ⊆ [ n ] , x J := ( x j ) j ∈J . Throughout this paper, we use the conventions inf ∅ = ∞ and sup ∅ = −∞ . We denote x ∨ y := max { x, y } and [ x ] + := x ∨ . Denote q (cid:48) := qq − as the Hölder conjugate of q . II. S

TRONG BL AND

HC I

NEQUALITIES : T WO -F UNCTION V ERSION

A. Information-Theoretic Characterizations

We ﬁrst provide information-theoretic characterizations of Λ p,q, (cid:98) p, (cid:98) q and Λ p,q, (cid:98) p, (cid:98) q , in the following proposition. Theproof is provided in Appendix A. Deﬁne φ ( Q X , Q Y | P XY ) := inf R XY (cid:26) D ( R XY (cid:107) P XY ) + 1 p D ( R X (cid:107) Q X ) − p D ( R X (cid:107) P X ) + 1 q D ( R Y (cid:107) Q Y ) − q D ( R Y (cid:107) P Y ) (cid:27) , (7)where according to Convention 1, the inﬁmization is taken over all R XY such that all the relative entropies appearingin the objective function are ﬁnite. Proposition 1 (Information-Theoretic Characterizations) . For ( p, q, (cid:98) p, (cid:98) q ) ∈ R \ { } , Λ p,q, (cid:98) p, (cid:98) q ( α, β ) = inf Q X ,Q Y : D (cid:98) pp ( Q X (cid:107) P X )= α,D (cid:98) qq ( Q Y (cid:107) P Y )= β φ ( Q X , Q Y | P XY ) + αp + βq , Λ p,q, (cid:98) p, (cid:98) q ( α, β ) = sup Q X ,Q Y : D (cid:98) pp ( Q X (cid:107) P X )= α,D (cid:98) qq ( Q Y (cid:107) P Y )= β φ ( Q X , Q Y | P XY ) + αp + βq , where in these two optimization problems, Q X (cid:28) P X if p, (cid:98) p > ; otherwise, Q X (cid:28)(cid:29) P X , and similarly, Q Y (cid:28) P Y if q, (cid:98) q > ; otherwise, Q Y (cid:28)(cid:29) P Y .B. Brascamp-Lieb Exponents Utilizing the information-theoretic expressions in Proposition 1, we next provide dimension-free bounds for the n -dimensional versions Λ ( n ) p,q, (cid:98) p, (cid:98) q and Λ ( n ) p,q, (cid:98) p, (cid:98) q .Deﬁne α max := sup Q X D ( Q X (cid:107) P X ) = sup A ∈ B X : P X ( A ) > − log P X ( A ) , if the most RHS is ﬁnite; otherwise (e.g., P X is not purely atomic), deﬁne α max := ∞ − . Deﬁne β max for P Y similarly.Let ( W , B W , Q W ) be an arbitrary probability measure space. Deﬁne the minimum conditional relative entropy over couplings of (cid:0) Q X | W , Q Y | W (cid:1) , with respect to P XY and conditionally on Q W , as D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) := inf Q XY | W ∈ C ( Q X | W ,Q Y | W ) D (cid:0) Q XY | W (cid:107) P XY | Q W (cid:1) . Deﬁne Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) := inf Q W ,Q X | W ,Q Y | W D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) + η p, (cid:98) p (cid:0) α, D (cid:0) Q X | W (cid:107) P X | Q W (cid:1)(cid:1) + η q, (cid:98) q (cid:0) β, D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1)(cid:1) where η p, (cid:98) p ( α, s ) :=  α − sp ∨ α − s (cid:98) p p, (cid:98) p > α − sp p > > (cid:98) p α − s (cid:98) p p < < (cid:98) p −∞ p, (cid:98) p < . For ease of presentation, here we introduce the notation ∞ − , and we will later use “ x ≤ ∞ − ” to denote “ x < ∞ ”, “ (cid:2) , ∞ − (cid:3) ” to denote“ [0 , ∞ ) ”, and “ x (cid:54) = ∞ − ” to denote “ x (cid:54) = ∞ ”. Here the inﬁmization is taken over all probability measure spaces ( W , B W , Q W ) and regular conditional probabilitymeasures Q X | W , Q Y | W . However, by Carathéodory’s theorem, without loss of optimality, it sufﬁces to restrict |W| ≤ .According to the signs of p, q, (cid:98) p, (cid:98) q , we partition the distributions (cid:110) Q X | W , Q Y | W , (cid:98) Q X | W , (cid:98) Q Y | W (cid:111) into two subsets Q + , Q − . Speciﬁcally, Q X | W ∈ Q + if p > ; Q X | W ∈ Q − otherwise. Similarly, Q Y | W , (cid:98) Q X | W , (cid:98) Q Y | W are assignedinto Q + , Q − respectively according to the signs of q, (cid:98) p, (cid:98) q . Deﬁne θ p,q (cid:0) Q W , Q X | W , Q Y | W (cid:1) := D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) + α − D (cid:0) Q X | W (cid:107) P X | Q W (cid:1) p + β − D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1) q , and α − λ := (cid:40) α λ ≤ otherwise , α + λ := (cid:40) α λ ≥ α max otherwise β − µ := (cid:40) β µ ≤ otherwise , β + µ := (cid:40) β µ ≥ β max otherwise . Deﬁne Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) := sup Q W , Q + inf Q − min (cid:26) θ p,q (cid:0) Q W , Q X | W , Q Y | W (cid:1) , θ p, (cid:98) q (cid:16) Q W , Q X | W , (cid:98) Q Y | W (cid:17) ,θ (cid:98) p,q (cid:16) Q W , (cid:98) Q X | W , Q Y | W (cid:17) , θ (cid:98) p, (cid:98) q (cid:16) Q W , (cid:98) Q X | W , (cid:98) Q Y | W (cid:17)(cid:27) , where the inﬁmization is taken over all the distributions in Q − , and the supremization is taken over all probabilitymeasure spaces ( W , B W , Q W ) and all the distributions in Q + under the constraints α − (cid:98) p/p ≤ D (cid:0) Q X | W (cid:107) P X | Q W (cid:1) ≤ α + (cid:98) p/p and α − p/ (cid:98) p ≤ D (cid:16) (cid:98) Q X | W (cid:107) P X | Q W (cid:17) ≤ α + p/ (cid:98) p if p, (cid:98) p > , as well as, β − (cid:98) q/q ≤ D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1) ≤ β + (cid:98) q/q and β − q/ (cid:98) q ≤ D (cid:16) (cid:98) Q Y | W (cid:107) P Y | Q W (cid:17) ≤ β + q/ (cid:98) q if q, (cid:98) q > . Similarly to the case of Λ ∗ p,q, (cid:98) p, (cid:98) q , by Carathéodory’s theorem,without loss of optimality, it sufﬁces to restrict |W| ≤ .We provide dimension-free bounds for Λ ( n ) p,q, (cid:98) p, (cid:98) q and Λ ( n ) p,q, (cid:98) p, (cid:98) q in the following theorem. The proof is given inAppendix B. Theorem 1 (Brascamp-Lieb Exponents) . For any n ≥ , p, q, (cid:98) p, (cid:98) q ∈ R \ { } , α ∈ (cid:40) [0 , α max ] p (cid:98) p > −∞ , p (cid:98) p < , and β ∈ (cid:40) [0 , β max ] q (cid:98) q > −∞ , q (cid:98) q < , we have Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) ≥ Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) (8) Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) ≤ Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) . (9) Moreover, for ﬁnite alphabets X , Y , these two inequalities are asymptotically sharp as n → ∞ for given ( p, q, (cid:98) p, (cid:98) q, α, β ) with α (cid:54) = 0 , α max and β (cid:54) = 0 , β max , i.e., Λ ( ∞ ) p,q, (cid:98) p, (cid:98) q ( α, β ) = Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β )Λ ( ∞ ) p,q, (cid:98) p, (cid:98) q ( α, β ) = Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) . Remark . Note that Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) (resp. Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) ) may not be always positive (or always negative). C. Strong Brascamp-Lieb Inequalities

Let ( X , Y ) ∼ P nXY . Theorem 1 can be rewritten as the following form. Corollary 1 (Strong Brascamp-Lieb Inequalities (Two-Function Version)) . For p, q, (cid:98) p, (cid:98) q ∈ R \ { } , E [ f ( X ) g ( Y )] ≤ e − n (cid:16) Λ ∗ p,q, (cid:98) p, (cid:98) q ( α,β ) − αp − βq (cid:17) (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q (10) E [ f ( X ) g ( Y )] ≥ e − n (cid:16) Λ ∗ p,q, (cid:98) p, (cid:98) q ( α,β ) − αp − βq (cid:17) (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q , (11) for all nonnegative functions f, g , where α = n H p, (cid:98) p ( f ) , β = n H q, (cid:98) q ( g ) .Remark . By symmetry, the inequalities (10) and (11) still hold if p, (cid:98) p (or q, (cid:98) q ) are swapped.The expressions of Λ ∗ p,q, (cid:98) p, (cid:98) q and Λ ∗ p,q, (cid:98) p, (cid:98) q are somewhat complicated. We next simplify the inequalities (10) and(11) by choosing ( (cid:98) p, (cid:98) q ) as some speciﬁc function of ( α, β ) .Deﬁne Θ ( α, β ) := inf Q XY W : D ( Q X | W (cid:107) P X | Q W ) ≥ α,D ( Q Y | W (cid:107) P Y | Q W ) ≥ β D (cid:0) Q XY | W (cid:107) P XY | Q W (cid:1) , (12) Θ ( α, β ) := sup Q W ,Q X | W ,Q Y | W : D ( Q X | W (cid:107) P X | Q W ) ≤ α,D ( Q Y | W (cid:107) P Y | Q W ) ≤ β inf Q XY | W ∈ C ( Q X | W ,Q Y | W ) D (cid:0) Q XY | W (cid:107) P XY | Q W (cid:1) , (13)and for q < , Θ q ( α ) := sup Q W ,Q X | W : D ( Q X | W (cid:107) P X | Q W ) ≤ α inf Q Y | W D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) − D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1) q . (14)Without loss of optimality, the alphabet size of W in (12) and (13) can be assumed to be no larger than , and thealphabet size of W in (14) can be assumed to be no larger than .Deﬁne the minimum-relative-entropy region of P XY as D ( P XY ) := (cid:91) Q X ,Q Y { ( D ( Q X (cid:107) P X ) , D ( Q Y (cid:107) P Y ) , D ( Q X , Q Y (cid:107) P XY )) } . Obviously, D ( P XY ) is determined by the following two quantities: ϕ ( s, t ) := inf Q XY : D ( Q X (cid:107) P X )= s,D ( Q Y (cid:107) P Y )= t D ( Q XY (cid:107) P XY )= inf Q X ,Q Y : D ( Q X (cid:107) P X )= s,D ( Q Y (cid:107) P Y )= t D ( Q X , Q Y (cid:107) P XY ) ,ψ ( s, t ) := sup Q X ,Q Y : D ( Q X (cid:107) P X )= s,D ( Q Y (cid:107) P Y )= t D ( Q X , Q Y (cid:107) P XY ) , and ψ q ( s ) := sup Q X : D ( Q X (cid:107) P X )= s inf Q Y D ( Q X , Q Y (cid:107) P XY ) − D ( Q Y (cid:107) P Y ) q . (15)Deﬁne ˘ ϕ ( s, t ) as the lower convex envelope of ϕ ( s, t ) , and “ ψ ( s, t ) , “ ψ q ( s ) respectively as the upper concaveenvelopes of ψ ( s, t ) , ψ q ( s ) . Then Θ ( α, β ) = inf s ≥ α,t ≥ β ˘ ϕ ( s, t ) (16) Θ ( α, β ) = sup s ≤ α,t ≤ β “ ψ ( s, t ) (17) Θ q ( α ) := sup s ≤ α “ ψ q ( s ) . (18)Hence, Θ ( α, β ) is convex in ( α, β ) , Θ ( α, β ) is concave in ( α, β ) , and Θ q ( α ) is concave in α for each q .Deﬁne various conditions on ( p, q, p ∗ , q ∗ , α, β, f, g ) as follows.

1) Condition 0+:

Θ ( α, β ) + s − αp ∗ + t − βq ∗ ≤ Θ ( s, t ) for all s, t ≥ .2) Condition 0-: Θ ( α, β ) + s − αp ∗ + t − βq ∗ ≥ Θ ( s, t ) for all s, t ≥ .3) Condition 0--: Θ q ( α ) + s − αp ∗ ≥ Θ q ( s ) for all s ≥ .4) Condition 1+: H p,p ∗ ( f ) ≥ α if p ≥ p ∗ ; H p,p ∗ ( f ) ≤ α if ≤ p ≤ p ∗ . Similarly, H q,q ∗ ( g ) ≥ β if q ≥ q ∗ ; H q,q ∗ ( g ) ≤ β if ≤ q ≤ q ∗ .5) Condition 1-: H p,p ∗ ( f ) ≤ α if p ≥ p ∗ ; H p,p ∗ ( f ) ≥ α if ≤ p ≤ p ∗ . Similarly, H q,q ∗ ( g ) ≤ β if q ≥ q ∗ ; H q,q ∗ ( g ) ≥ β if ≤ q ≤ q ∗ .6) Condition 1--: H p,p ∗ ( f ) ≤ α if p ≥ p ∗ ; H p,p ∗ ( f ) ≥ α if ≤ p ≤ p ∗ .In Conditions 0+ and 0-, if Θ ( s, t ) and Θ ( s, t ) are differentiable at the point ( α, β ) , then ( s, t ) (cid:55)→ Θ ( α, β ) + s − αp ∗ + t − βq ∗ and ( s, t ) (cid:55)→ Θ ( α, β ) + s − αp ∗ + t − βq ∗ respectively correspond to the tangent planes of Θ ( s, t ) and Θ ( s, t ) at the point ( α, β ) . Similarly, s (cid:55)→ Θ q ( α ) + s − αp ∗ corresponds to the tangent line of Θ q ( s ) at the point s = α , if Θ q ( s ) is differentiable at the point s = α .By Corollary 1, we obtain the following simple version of strong (forward and reverse) Brascamp-Lieb inequal-ities. Theorem 2 (Strong Brascamp-Lieb Inequalities (Two-Function Version)) . Let α ∈ [0 , α max ] , β ∈ [0 , β max ] . Thenthe following hold.1) Let p ∗ , q ∗ > satisfy Condition 0+. Let p, q > . Then E [ f ( X ) g ( Y )] ≤ e − n (cid:16) Θ( α,β ) − αp − βq (cid:17) (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q (19) for all nonnegative ( f, g ) satisfying Condition 1+.2) Let p ∗ , q ∗ > satisfy Condition 0-. Let p, q > . Then E [ f ( X ) g ( Y )] ≥ e − n (cid:16) Θ( α,β ) − αp − βq (cid:17) (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q (20) for all nonnegative ( f, g ) satisfying Condition 1-.3) Let p ∗ > > q satisfy Condition 0--. Let p > . Then E [ f ( X ) g ( Y )] ≥ e − n (cid:16) Θ q ( α ) − αp (cid:17) (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q (21) for all nonnegative f satisfying Condition 1-- and all nonnegative g .Remark . We now discus other cases that are not considered in the theorem above. For the forward case, for p ≤ or q ≤ , we do not know how to prove a nontrivial upper bound on E [ f ( X ) g ( Y )] , i.e., currently we onlyhave E [ f ( X ) g ( Y )] ≤ ∞ . For the reverse case, for p, q ≤ , by the reverse Hölder inequality, E [ f ( X ) g ( Y )] ≥(cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q holds for all nonnegative ( f, g ) . Remark . Note that in Condition 0+, p ∗ , q ∗ ≥ . Hence for p ≥ p ∗ ≥ , H /p ( f p ) = H p ( f ) = H p, ( f ) ≤ H p,p ∗ ( f ) ≤ H p,p ( f ) = H ( f p ) . Similarly, in Condition 0-, < p ∗ , q ∗ ≤ . Hence for < p ≤ p ∗ ≤ , H /p ( f p ) = H p ( f ) = H p, ( f ) ≥ H p,p ∗ ( f ) ≥ H p,p ( f ) = H ( f p ) . Remark . Note that

Θ ( α, β ) − αp − βq , Θ ( α, β ) − αp − βq , and Θ q ( α ) − αp may not be always positive (or alwaysnegative). Proof:

Observe that Λ ∗ p,q,p ∗ ,q ∗ ( a, b ) = inf Q W ,Q X | W ,Q Y | W D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) + η p,p ∗ (cid:0) a, D (cid:0) Q X | W (cid:107) P X | Q W (cid:1)(cid:1) + η q,q ∗ (cid:0) b, D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1)(cid:1) = inf s,t ≥ inf Q W ,Q X | W ,Q Y | W : D ( Q X | W (cid:107) P X | Q W )= s,D ( Q Y | W (cid:107) P Y | Q W )= t D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) + η p,p ∗ ( a, s ) + η q,q ∗ ( b, t ) ≥ inf s,t ≥ Θ ( s, t ) + η p,p ∗ ( a, s ) + η q,q ∗ ( b, t ) . For nonnegative ( f, g ) satisfying Condition 1+, denote a = H p,p ∗ ( f ) and b = H q,q ∗ ( g ) . Then we have Λ ∗ p,q,p ∗ ,q ∗ ( a, b ) − ap − bq ≥ inf s,t ≥ Θ ( s, t ) + − sp ∨ (cid:18)(cid:18) p ∗ − p (cid:19) a − sp ∗ (cid:19) + − tq ∨ (cid:18)(cid:18) q ∗ − q (cid:19) b − tq ∗ (cid:19) ≥ inf s,t ≥ Θ ( s, t ) + − sp ∨ (cid:18)(cid:18) p ∗ − p (cid:19) α − sp ∗ (cid:19) + − tq ∨ (cid:18)(cid:18) q ∗ − q (cid:19) β − tq ∗ (cid:19) = inf s,t ≥ Θ ( s, t ) + η p,p ∗ ( α, s ) + η q,q ∗ ( β, t ) − αp − βq ≥ inf s,t ≥ Θ ( s, t ) + α − sp ∗ + β − tq ∗ − αp − βq = Θ ( α, β ) − αp − βq . Substituting this into (10) yields (19).We next prove (20). Since p, q, p ∗ , q ∗ > , we have that Λ ∗ p,q,p ∗ ,q ∗ ( a, b ) = sup Q W , Q + min (cid:26) θ p,q (cid:0) Q W , Q X | W , Q Y | W (cid:1) , θ p,q ∗ (cid:16) Q W , Q X | W , (cid:98) Q Y | W (cid:17) ,θ p ∗ ,q (cid:16) Q W , (cid:98) Q X | W , Q Y | W (cid:17) , θ p ∗ ,q ∗ (cid:16) Q W , (cid:98) Q X | W , (cid:98) Q Y | W (cid:17)(cid:27) ≤ sup Q W , (cid:98) Q X | W , (cid:98) Q Y | W : α − p/p ∗ ≤ D ( (cid:98) Q X | W (cid:107) P X | Q W ) ≤ α + p/p ∗ ,β − q/q ∗ ≤ D ( (cid:98) Q Y | W (cid:107) P Y | Q W ) ≤ β + q/q ∗ θ p ∗ ,q ∗ (cid:16) Q W , (cid:98) Q X | W , (cid:98) Q Y | W (cid:17) ≤ sup α − p/p ∗ ≤ (cid:98) s ≤ α + p/p ∗ ,β − q/q ∗ ≤ (cid:98) t ≤ β + q/q ∗ Θ (cid:0)(cid:98) s, (cid:98) t (cid:1) + a − (cid:98) sp ∗ + b − (cid:98) tq ∗ . (22)Therefore, for the case p ≥ p ∗ , q ≥ q ∗ (and hence by assumption, ≤ a ≤ α, ≤ b ≤ β ), we further have Λ ∗ p,q,p ∗ ,q ∗ ( a, b ) ≤ sup (cid:98) s ≤ a, (cid:98) t ≤ b Θ (cid:0)(cid:98) s, (cid:98) t (cid:1) + a − (cid:98) sp ∗ + b − (cid:98) tq ∗ = Θ ( a, b ) , and hence, Λ ∗ p,q,p ∗ ,q ∗ ( a, b ) − ap − bq ≤ Θ ( a, b ) − ap − bq ≤ Θ ( α, β ) − αp − βq , (23)where (23) follows by the concavity of Θ and the assumptions p ≥ p ∗ , q ≥ q ∗ . Substituting (23) into (11) yields(20). The case of p ≥ p ∗ , q ≤ q ∗ , the case of p ≤ p ∗ , q ≥ q ∗ , the case of p ≤ p ∗ , q ≤ q ∗ can be proven similarly.We lastly prove (21). For p ≥ p ∗ (and hence by assumption, a ≤ α ), we have Λ ∗ p,q,p ∗ ,q ∗ ( a, b ) ≤ sup (cid:98) s ≤ a sup Q W ,Q X | W : D ( Q X | W (cid:107) P X | Q W )= (cid:98) s inf Q Y | W D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) + b − D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1) q + a − (cid:98) sp ∗ ≤ sup (cid:98) s ≤ a Θ q ( (cid:98) s ) + a − (cid:98) sp ∗ + bq = Θ q ( a ) + bq . Hence, Λ ∗ p,q,p ∗ ,q ∗ ( a, b ) − ap − bq ≤ Θ q ( a ) − ap ≤ Θ q ( α ) − αp , (24) where (24) follows by the concavity of Θ q and the assumptions p ≥ p ∗ , q ≥ q ∗ . Substituting this into (11) yields(21). The case p ≤ p ∗ can be proven similarly.It is well known that H p,p ∗ ( f ) ≥ H ( f ) = − log P nX (supp ( f )) . Hence P nX (supp ( f )) ≤ e − nα implies H p,p ∗ ( f ) ≥ α for any p, p ∗ ≥ . Corollary 2.

Let α ∈ [0 , α max ] , β ∈ [0 , β max ] . Let f, g be nonnegative functions such that P nX (supp ( f )) ≤ e − nα , P nY (supp ( g )) ≤ e − nβ . Then the following hold.1) Let p ∗ , q ∗ > satisfy Condition 0+. Then for any p ≥ p ∗ , q ≥ q ∗ , E [ f ( X ) g ( Y )] ≤ e − n (cid:16) Θ( α,β ) − αp − βq (cid:17) (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q . (25)

2) Let p ∗ , q ∗ > satisfy Condition 0-. Then for any < p ≤ p ∗ , < q ≤ q ∗ , E [ f ( X ) g ( Y )] ≥ e − n (cid:16) Θ( α,β ) − αp − βq (cid:17) (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q . (26)

3) Let p ∗ > > q satisfy Condition 0--. Then for any < p ≤ p ∗ , E [ f ( X ) g ( Y )] ≥ e − n (cid:16) Θ q ( α ) − αp (cid:17) (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q . (27)(25) and (26) were ﬁrst derived in [3] for the ﬁnite alphabet case. D. Strong Hypercontractivity Inequalities

We next derive a strong version of hypercontinuity inequalities, i.e., a special class of BL inequalities with thefactors equal to . To this end, it sufﬁces to ﬁnd conditions under which Θ ( α, β ) − αp − βq ≥ holds for (19), Θ ( α, β ) − αp − βq ≤ holds for (20), and Θ q ( α ) − αp ≤ holds for (21).For α ∈ [0 , α max ] , β ∈ [0 , β max ] , we deﬁne the (forward and reverse) hypercontractivity regions as (cid:101) R ( α,β ) FH ( P XY ) := (cid:110) ( p, q ) ∈ [1 , ∞ ) : Θ ( s, t ) ≥ sp + tq , ∀ s ≥ α, t ≥ β (cid:111) (28) (cid:101) R ( α,β ) RH ( P XY ) := (cid:110) ( p, q ) ∈ (0 , : Θ ( s, t ) ≤ sp + tq , ∀ s ≥ α, t ≥ β (cid:111) (29) (cid:101) R ( α ) RH ,q ( P XY ) := (cid:110) p ∈ (0 ,

1] : Θ q ( s ) ≤ sp , ∀ s ≥ α (cid:111) . (30)When specialized to the case of α = β = 0 , (cid:101) R ( α,β ) FH ( P XY ) and (cid:101) R ( α,β ) RH ( P XY ) reduce to the usual hypercontractivityregions [15], [17], [18], [24, (6.117)].Following similar steps to the proof of Theorem 2, it is easy to obtain the following new version of hypercon-tractivity. Corollary 3 (Strong Hypercontractivity Inequalities) . Let α ∈ [0 , α max ] , β ∈ [0 , β max ] . Then the following hold.1) Let p ∗ , q ∗ > satisfy Condition 0+. Let ( p, q ) ∈ (cid:101) R ( α,β ) FH ( P XY ) . Then E [ f ( X ) g ( Y )] ≤ (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q (31) for all nonnegative ( f, g ) satisfying H p,p ∗ ( f ) ≥ α and H q,q ∗ ( g ) ≥ β .2) Let p ∗ , q ∗ > satisfy Condition 0-. Let ( p, q ) ∈ (cid:101) R ( α,β ) RH ( P XY ) . Then E [ f ( X ) g ( Y )] ≥ (cid:107) f ( X ) (cid:107) p (cid:107) g ( Y ) (cid:107) q (32) for all nonnegative ( f, g ) satisfying H p,p ∗ ( f ) ≥ α and H q,q ∗ ( g ) ≥ β .3) Let p ∗ > > q satisfy Condition 0--. Let p ∈ (cid:101) R ( α ) RH ,q ( P XY ) . Then (32) still holds for all nonnegative f satisfying H p,p ∗ ( f ) ≥ α and all nonnegative g .Remark . For ﬁnite X , Y , given α, β > , the inequality (31) is asymptotically sharp, in the sense that for any ( p, q ) ∈ [1 , ∞ ) \ (cid:101) R ( α,β ) FH ( P XY ) , there exist a positive integer n and a pair of functions ( f, g ) on X n , Y n respectively Rigorously speaking, only hypercontractivity ribbons, rather than hypercontractivity regions, were deﬁned in [17], [24, (6.117)]. However,they are deﬁned similarly, but with taking the Hölder conjugate of p or q and removing the region of ( p, q ) on which Hölder inequalities orits reverse versions hold. that satisfy the assumption in Theorem 1 but violates (31). Given α, β > , the inequality (32) is asymptoticallysharp in a similar sense.Conceptually, the hypercontractivity regions in (28)-(30) can be seen as a set of ( p, q ) on which the hypercon-tractivity inequalities (or the reverse versions) hold in the large-deviation regime; while the usual hypercontractivityregions correspond to the moderate-deviation regime. In large deviation theory, it is well known that the moderate-deviation rate function can be recovered from the large-deviation rate function, by letting the threshold parametergo to zero in a certain speed [25]. Our hypercontractivity inequalities here are stronger than the usual ones in asimilar sense, if P XY satisﬁes that Θ ( s, t ) ≥ sp + tq , Θ ( s, t ) ≤ sp + tq , and Θ q ( s ) ≤ sp hold for all s, t ≥ if andonly if they hold for any neighborhood of the origin.III. S TRONG BL AND

HC I

NEQUALITIES : S

INGLE -F UNCTION V ERSION

We next consider the single-function version of BL exponents. We still let ( X , Y ) ∼ P nXY . Given ( p, q, (cid:98) p ) ∈ R and P XY , we deﬁne the single-function version of optimal forward and reverse BL exponents as Υ p,q, (cid:98) p ( α | P XY ) := − sup f ∈F α log (cid:107) E [ f ( X ) | Y ] (cid:107) q (cid:107) f ( X ) (cid:107) (cid:98) p (33) Υ p,q, (cid:98) p ( α | P XY ) := − inf f ∈F α log (cid:107) E [ f ( X ) | Y ] (cid:107) q (cid:107) f ( X ) (cid:107) (cid:98) p . (34)Similar to Lemma 1, we also have the following lemma. Lemma 2.

Both Γ p,q, (cid:98) p ( α | P XY ) := Υ p,q, (cid:98) p ( α | P XY ) + α (cid:98) p and Γ p,q, (cid:98) p ( α | P XY ) := Υ p,q, (cid:98) p ( α | P XY ) + α (cid:98) p are symmetricw.r.t. ( p, (cid:98) p ) . For the product distribution P nXY , we deﬁne the forward and reverse n -BL exponents respectively as Γ ( n ) p,q, (cid:98) p ( α ) := n Γ p,q, (cid:98) p ( nα | P nXY ) and Γ ( n ) p,q, (cid:98) p ( α ) := n Γ p,q, (cid:98) p ( nα | P nXY ) , as well as, the forward and reverse asymptotic BL exponents respectively as Γ ( ∞ ) p,q, (cid:98) p ( α ) := lim n →∞ Γ ( n ) p,q, (cid:98) p ( α ) and Γ ( ∞ ) p,q, (cid:98) p ( α ) := lim n →∞ Γ ( n ) p,q, (cid:98) p ( α ) .Deﬁne Γ ∗ p,q, (cid:98) p ( α ) :=  inf Q W ,Q X | W ,Q Y | W θ p,q (cid:48) , (cid:98) p (cid:0) Q W , Q X | W , Q Y | W (cid:1) q > Q W ,Q X | W ,Q Y θ p,q (cid:48) , (cid:98) p (cid:0) Q W , Q X | W , Q Y (cid:1) < q < Q W sup Q Y inf Q X | W θ p,q (cid:48) , (cid:98) p (cid:0) Q W , Q X | W , Q Y (cid:1) q < (35)where q (cid:48) := qq − is the Hölder conjugate of q , and θ p,q (cid:48) , (cid:98) p (cid:0) Q W , Q X | W , Q Y | W (cid:1) := D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) − D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1) q (cid:48) + η p, (cid:98) p (cid:0) α, D (cid:0) Q X | W (cid:107) P X | Q W (cid:1)(cid:1) . By Carathéodory’s theorem, without loss of optimality, it sufﬁces to restrict |W| ≤ .According to the signs of p, (cid:98) p , we partition the distributions (cid:110) Q X | W , (cid:98) Q X | W (cid:111) into two subsets S + , S − . Specif-ically, Q X | W ∈ S + if p > ; Q X | W ∈ S − otherwise. Similarly, (cid:98) Q X | W are assigned into S + , S − respectivelyaccording to the sign of (cid:98) p . Deﬁne Γ ∗ p,q, (cid:98) p ( α ) :=  sup Q W sup S + inf S − inf Q Y min (cid:26) θ p,q (cid:48) (cid:0) Q W , Q X | W , Q Y (cid:1) , θ (cid:98) p,q (cid:48) (cid:16) Q W , (cid:98) Q X | W , Q Y (cid:17)(cid:27) q > Q W sup S + inf S − inf Q Y | W min (cid:26) θ p,q (cid:48) (cid:0) Q W , Q X | W , Q Y | W (cid:1) , θ (cid:98) p,q (cid:48) (cid:16) Q W , (cid:98) Q X | W , Q Y | W (cid:17) < q < Q WY sup S + inf S − min (cid:26) θ p,q (cid:48) (cid:0) Q W , Q X | W , Q Y | W (cid:1) , θ (cid:98) p,q (cid:48) (cid:16) Q W , (cid:98) Q X | W , Q Y | W (cid:17)(cid:27) q < (36)where θ p,q (cid:48) (cid:0) Q W , Q X | W , Q Y | W (cid:1) := D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) − D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1) q (cid:48) + α − D (cid:0) Q X | W (cid:107) P X | Q W (cid:1) p . and the inﬁmization inf S − is taken over all the distributions in S − , and the supremization sup S + is taken over all thedistributions in S + under the constraints α − (cid:98) p/p ≤ D (cid:0) Q X | W (cid:107) P X | Q W (cid:1) ≤ α + (cid:98) p/p and α − p/ (cid:98) p ≤ D (cid:16) (cid:98) Q X | W (cid:107) P X | Q W (cid:17) ≤ α + p/ (cid:98) p if p, (cid:98) p > . By Carathéodory’s theorem, without loss of optimality, it sufﬁces to restrict |W| ≤ .We obtain the following single-function version of strong BL inequalities. The proof is provided in Appendix C. Theorem 3 (Brascamp-Lieb Exponents (Single-Function Version)) . For any n ≥ , p, q, (cid:98) p ∈ R \ { } , and α ∈ (cid:40) [0 , α max ] p (cid:98) p > −∞ , p (cid:98) p < , we have Γ ( n ) p,q, (cid:98) p ( α ) ≥ Γ ∗ p,q, (cid:98) p ( α ) (37) Γ ( n ) p,q, (cid:98) p ( α ) ≤ Γ ∗ p,q, (cid:98) p ( α ) . (38) Moreover, for ﬁnite alphabets X , Y , these two inequalities are asymptotically sharp as n → ∞ for given ( p, q, (cid:98) p, α ) with α (cid:54) = 0 , α max . The theorem above is equivalent to the following single-function version of strong BL inequalities.

Corollary 4 (Strong Brascamp-Lieb Inequalities (Single-Function Version)) . For any n ≥ and p, q, (cid:98) p ∈ R \ { } , (cid:107) E [ f ( X ) | Y ] (cid:107) q ≤ e − n (cid:16) Γ ∗ p,q, (cid:98) p ( α ) − αp (cid:17) (cid:107) f ( X ) (cid:107) p (39) (cid:107) E [ f ( X ) | Y ] (cid:107) q ≥ e − n (cid:16) Γ ∗ p,q, (cid:98) p ( α ) − αp (cid:17) (cid:107) f ( X ) (cid:107) p , (40) for all nonnegative function f , where α = n H p, (cid:98) p ( f ) .Remark . By symmetry, the inequalities (39) and (40) still hold if p, (cid:98) p are swapped.Deﬁne Θ q (cid:48) ( α ) :=  inf Q W ,Q X | W ,Q Y | W : D ( Q X | W (cid:107) P X | Q W ) ≥ α D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) − D ( Q Y | W (cid:107) P Y | Q W ) q (cid:48) q > Q W ,Q X | W ,Q Y : D ( Q X | W (cid:107) P X | Q W ) ≥ α D (cid:0) Q X | W , Q Y (cid:107) P XY | Q W (cid:1) − D ( Q Y (cid:107) P Y ) q (cid:48) < q < Q W sup Q Y inf Q X | W : D ( Q X | W (cid:107) P X | Q W ) ≥ α D (cid:0) Q X | W , Q Y (cid:107) P XY | Q W (cid:1) − D ( Q Y (cid:107) P Y ) q (cid:48) q < (41)and Θ q (cid:48) ( α ) :=  sup Q W ,Q X | W : D ( Q X | W (cid:107) P X | Q W ) ≤ α inf Q Y D (cid:0) Q X | W , Q Y (cid:107) P XY | Q W (cid:1) − D ( Q Y (cid:107) P Y ) q (cid:48) q > Q W ,Q X | W : D ( Q X | W (cid:107) P X | Q W ) ≤ α inf Q Y | W D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) − D ( Q Y | W (cid:107) P Y | Q W ) q (cid:48) < q < Q W ,Q X | W ,Q Y | W : D ( Q X | W (cid:107) P X | Q W ) ≤ α D (cid:0) Q X | W , Q Y | W (cid:107) P XY | Q W (cid:1) − D ( Q Y | W (cid:107) P Y | Q W ) q (cid:48) q < . (42)It is easy to verify that Θ q (cid:48) ( α ) , Θ q (cid:48) ( α ) ≥ . Deﬁne various conditions, which are similar to those in the two-function case:1) Condition 0+: Θ q (cid:48) ( α ) + s − αp ∗ ≤ Θ q (cid:48) ( s ) for all s ≥ .2) Condition 0-: Θ q (cid:48) ( α ) + s − αp ∗ ≥ Θ q (cid:48) ( s ) for all s ≥ .3) Condition 1+: H p,p ∗ ( f ) ≥ α if p ≥ p ∗ ; H p,p ∗ ( f ) ≤ α if ≤ p ≤ p ∗ .4) Condition 1-: H p,p ∗ ( f ) ≤ α if p ≥ p ∗ ; H p,p ∗ ( f ) ≥ α if ≤ p ≤ p ∗ .In Conditions 0+ and 0-, if Θ q (cid:48) ( s ) and Θ q (cid:48) ( s ) are differentiable at the point s = α , then s (cid:55)→ Θ q (cid:48) ( α ) + s − αp ∗ and s (cid:55)→ Θ q (cid:48) ( α ) + s − αp ∗ respectively correspond to the tangent lines of Θ q (cid:48) ( s ) and Θ q (cid:48) ( s ) at the point s = α .By Corollary 4, we obtain the following simpler version of strong (forward and reverse) BL inequalities. Theorem 4 (Strong Brascamp-Lieb Inequalities (Single-Function Version)) . Let α ∈ [0 , α max ] . Then the followinghold.1) Let p ∗ > , q (cid:54) = 1 satisfy Condition 0+. Let p > . Then (cid:107) E [ f ( X ) | Y ] (cid:107) q ≤ e − n (cid:16) Θ q (cid:48) ( α ) − αp (cid:17) (cid:107) f ( X ) (cid:107) p (43) for all nonnegative f satisfying Condition 1+.

2) Let p ∗ > , q (cid:54) = 1 satisfy Condition 0-. Let p > . Then (cid:107) E [ f ( X ) | Y ] (cid:107) q ≥ e − n (cid:16) Θ q (cid:48) ( α ) − αp (cid:17) (cid:107) f ( X ) (cid:107) p (44) for all nonnegative f satisfying Condition 1-.Remark . Note that Θ q (cid:48) ( α ) − αp and Θ q (cid:48) ( α ) − αp may not be always positive (or always negative).For α ≥ , deﬁne (cid:101) R ( α ) FH ,q (cid:48) ( P XY ) := (cid:110) p ∈ [1 , ∞ ) : Θ q (cid:48) ( s ) ≥ sp , ∀ s ∈ [ α, α max ] (cid:111)(cid:101) R ( α ) RH ,q (cid:48) ( P XY ) := (cid:110) p ∈ (0 ,

1] : Θ q (cid:48) ( s ) ≤ sp , ∀ s ∈ [ α, α max ] (cid:111) . Note that here (cid:101) R ( α ) RH ,q (cid:48) ( P XY ) is consistent with the one deﬁned in (30). By Theorem 4, it is easy to obtain thefollowing strong version of hypercontractivity. Corollary 5 (Strong Hypercontractivity Inequalities) . Let α ∈ [0 , α max ] . Then the following hold.1) Let p ∗ > , q ≥ satisfy Condition 0+. Let p ∈ (cid:101) R ( α ) FH ,q (cid:48) ( P XY ) . Then (cid:107) E [ f ( X ) | Y ] (cid:107) q ≤ e − n (cid:16) Θ q (cid:48) ( α ) − αp (cid:17) (cid:107) f ( X ) (cid:107) p (45) for all nonnegative f satisfying Condition 1+.2) Let p ∗ > , q ≤ satisfy Condition 0-. Let p ∈ (cid:101) R ( α ) RH ,q (cid:48) ( P XY ) . Then (cid:107) E [ f ( X ) | Y ] (cid:107) q ≥ e − n (cid:16) Θ q (cid:48) ( α ) − αp (cid:17) (cid:107) f ( X ) (cid:107) p (46) for all nonnegative f satisfying Condition 1-.Remark . For ﬁnite X , Y , given α > , the inequality (45) is asymptotically sharp, in the sense that for any p ∈ [1 , ∞ ) \ (cid:101) R ( α ) FH ,q (cid:48) ( P XY ) , there exists a positive integer n and a nonnegative function f on X n that satisfy theassumption in Theorem 1 but violates (45). Given α > , the inequality (46) is asymptotically sharp in a similarsense. IV. E XAMPLES : B

INARY C ASE

In this section, the bases of logarithms are set to . Consider a doubly symmetric binary distribution P XY withcorrelation ρ , i.e., P XY = X \ Y ρ − ρ − ρ ρ . Deﬁne k = (cid:16) ρ − ρ (cid:17) . Deﬁne D ( a ) := D (cid:18) a (cid:107) (cid:19) = 1 − H ( a ) ,D a,b ( p ) := D (cid:18) ( p, a − p, b − p, p − a − b ) (cid:107) (cid:18) ρ , − ρ , − ρ , ρ (cid:19)(cid:19) , D ( a, b ) := min ,a + b − ≤ p ≤ a,b D a,b ( p )= D a,b (cid:0) p ∗ a,b (cid:1) , where H : t ∈ [0 , (cid:55)→ − t log t − (1 − t ) log (1 − t ) is the binary entropy function, and p ∗ a,b = ( k −

1) ( a + b ) + 1 − (cid:113) (( k −

1) ( a + b ) + 1) − k ( k − ab k − . ϕ ( s, t ) Θ ( s, t )Θ ( s, t ) = ψ ( s, t ) Θ q ( s ) = ψ q ( s ) Figure 1. Illustration of ϕ ( s, t ) , ψ ( s, t ) , ψ q ( s ) , Θ ( s, t ) , Θ ( s, t ) , and Θ q ( s ) for ρ = 0 . . Note that Θ ( s, t ) , Θ ( s, t ) , and Θ q ( s ) areexpressed in terms of ϕ ( s, t ) , ψ ( s, t ) , and ψ q ( s ) via (16)-(18). For the doubly symmetric binary distribution P XY , ϕ ( s, t ) = D (cid:0) H − (1 − s ) , H − (1 − t ) (cid:1) ψ ( s, t ) = D (cid:0) H − (1 − s ) , − H − (1 − t ) (cid:1) ψ q ( s ) = min ≤ b ≤ D (cid:0) H − (1 − s ) , b (cid:1) − D ( b ) q where H − is the inverse of the restriction of the binary entropy function H to the set (cid:2) , (cid:3) . Then Θ ( α, β ) , Θ ( α, β ) , and Θ q ( α ) are determined by ϕ ( s, t ) , ψ ( s, t ) , and ψ q ( s ) via (16)-(18). The functions ϕ ( s, t ) , ψ ( s, t ) , ψ q ( s ) , Θ ( s, t ) , Θ ( s, t ) , and Θ q ( s ) for ρ = 0 . are plotted in Fig. 1.V. A PPLICATIONS

A. Rényi Concentration Functions

Deﬁne the ( p, q ) -Rényi concentration functions of (cid:0) P X , P Y | X (cid:1) as η p → q (cid:0) α | P X , P Y | X (cid:1) := sup Q X : D p ( Q X (cid:107) P X )= α D q ( Q Y (cid:107) P Y ) η p → q (cid:0) α | P X , P Y | X (cid:1) := inf Q X : D p ( Q X (cid:107) P X )= α D q ( Q Y (cid:107) P Y ) where P Y := P X ◦ P Y | X and Q Y := Q X ◦ P Y | X . Note that lim α ↓ α η p → q (cid:0) α | P X , P Y | X (cid:1) corresponds to the Rényihypercontractivity constants deﬁned by Raginsky [26]. He observed that the Rényi hypercontractivity constantsare equal to the optimal constants in the hypercontractivity inequalities for given ( p, q ) in the hypercontractivity region. As a special case, the (1 , -Rényi concentration function (which we term the data-processing function ) of (cid:0) P X , P Y | X (cid:1) is η (cid:0) α | P X , P Y | X (cid:1) := η → (cid:0) α | P X , P Y | X (cid:1) = sup Q X : D ( Q X (cid:107) P X )= α D ( Q Y (cid:107) P Y ) . Lemma 3 (Equivalence) . For p, q ∈ R \ { } and α ∈ (cid:40) [0 , α max ] p > −∞ , p < , we have η p → q (cid:0) α | P X , P Y | X (cid:1) = (cid:40) q (cid:48) (cid:0) α − Γ p,q, ( α | P XY ) (cid:1) q > or q < q (cid:48) (cid:0) α − Γ p,q, ( α | P XY ) (cid:1) < q < and η p → q (cid:0) α | P X , P Y | X (cid:1) = (cid:40) q (cid:48) (cid:0) α − Γ p,q, ( α | P XY ) (cid:1) q > or q < q (cid:48) (cid:0) α − Γ p,q, ( α | P XY ) (cid:1) < q < Deﬁne η ( n ) p → q (cid:0) α | P X , P Y | X (cid:1) := 1 n η p → q (cid:16) nα | P nX , P nY | X (cid:17) η ( n ) p → q (cid:0) α | P X , P Y | X (cid:1) := 1 n η p → q (cid:16) nα | P nX , P nY | X (cid:17) , and η ( ∞ ) p → q and η ( ∞ ) p → q as their limits as n → ∞ . Theorem 5 (Generalized Mr. and Mrs. Gerber’s Lemmas) . For p, q ∈ R \ { } and α ∈ (cid:40) [0 , α max ] p > −∞ , p < , wehave η ( n ) p → q (cid:0) α | P X , P Y | X (cid:1) ≤ η ∗ p → q (cid:0) α | P X , P Y | X (cid:1) := (cid:40) q (cid:48) (cid:0) α − Γ ∗ p,q, ( α ) (cid:1) q > or q < q (cid:48) (cid:16) α − Γ ∗ p,q, ( α ) (cid:17) < q < η ( n ) p → q (cid:0) α | P X , P Y | X (cid:1) ≥ η ∗ p → q (cid:0) α | P X , P Y | X (cid:1) := (cid:40) q (cid:48) (cid:16) α − Γ ∗ p,q, ( α ) (cid:17) q > or q < q (cid:48) (cid:0) α − Γ ∗ p,q, ( α ) (cid:1) < q < where Γ ∗ p,q, ( α ) , Γ ∗ p,q, ( α ) are respectively deﬁned in (35) and (36) . Moreover, for ﬁnite alphabets X , Y , these twoinequalities are asymptotically sharp as n → ∞ for given ( p, q, α ) with α (cid:54) = 0 , α max , i.e., η ( ∞ ) p → q (cid:0) α | P X , P Y | X (cid:1) = η ∗ p → q (cid:0) α | P X , P Y | X (cid:1) η ( ∞ ) p → q (cid:0) α | P X , P Y | X (cid:1) = η ∗ p → q (cid:0) α | P X , P Y | X (cid:1) . The Mr. and Mrs. Gerber’s lemmas in [21], [22] only focus on the KL divergence and the doubly symmetricbinary distribution. Hence, Theorem 5 is a generalization of Mrs. Gerber’s lemmas to the Rényi divergence andarbitrary distributions on Polish spaces.

B. Strong Small-Set Expansion Theorems

Deﬁne Θ ( n ) ( α, β ) := − n log sup A ∈ B ⊗ n X ,B ∈ B ⊗ n Y : P nX ( A ) ≤ e − nα ,P nY ( B ) ≤ e − nβ P nXY ( A × B ) , (47) Θ ( n ) ( α, β ) := − n log inf A ∈ B ⊗ n X ,B ∈ B ⊗ n Y : P nX ( A ) ≥ e − nα ,P nY ( B ) ≥ e − nβ P nXY ( A × B ) , (48) and Θ ( ∞ ) and Θ ( ∞ ) as their limits as n → ∞ . Our strong hypercontractivity inequalities imply the following strongsmall-set expansion theorem. Theorem 6 (Strong Small-Set Expansion Theorem) . For any n ≥ and α ∈ [0 , α max ] , β ∈ [0 , β max ] , Θ ( n ) ( α, β ) ≥ Θ ( α, β ) , (49) Θ ( n ) ( α, β ) ≤ Θ ( α, β ) , (50) where Θ ( α, β ) , Θ ( α, β ) are respectively deﬁned in (12) and (13) . This theorem for ﬁnite alphabets X , Y was ﬁrst proven by the present author together with Anantharam, andChen [3]. Our theorem here is a generalization of the one in [3] to arbitrary distributions on any measurablespaces. It was shown that for ﬁnite alphabets X , Y , (49) and (50) are asymptotically sharp as n → ∞ for given α, β > , i.e., Θ ( ∞ ) ( α, β ) = Θ ( α, β ) , Θ ( ∞ ) ( α, β ) = Θ ( α, β ) . As discussed in [3], our strong small-set expansiontheorem is sharper than O’Donnell’s forward small-set expansion theorem [27] and Mossel et al’s reverse small-setexpansion theorem [28]. Moreover, for the limiting case of n → ∞ , our theorems reduce to theirs for a sequenceof pairs (cid:0) α ( n ) , β ( n ) (cid:1) such that lim n →∞ α ( n ) = lim n →∞ β ( n ) = 0 , e.g., e − nα ( n ) = a, e − nβ ( n ) = b for some ﬁxed < a, b < . Similar to the observation in Section II-D, conceptually, our small-set expansion theorem here canbe seen as the large deviation theorem of the small-set expansion problem; while the forward and reverse small-setexpansion theorems in [27] and [28] correspond to the moderate deviation theorem of the small-set expansionproblem. The moderate deviation theorem can be recovered from the large deviation theorem by letting α, β go tozero in a certain speed, if P XY satisﬁes that Θ ( s, t ) ≥ sp + tq and Θ ( s, t ) ≤ sp + tq for all s, t ≥ if and only if theyhold for any neighborhood of the origin, e.g., doubly symmetric binary distributions. For this case, the small-setexpansion theorems in [27] and [28] are sharp in the moderate deviation regime. Furthermore, strengthening thesmall-set expansion theorem was also studied by Ordentlich, Polyanskiy, and Shayevitz [29], but they only solvedlimiting cases ρ → , . The symmetric case α = β was solved by Kirshner and Samorodnitsky [2]. Our theoremis a generalization of theirs. C. q -Stability In this subsection, we study the q -stability of Boolean functions. Let ( X , Y ) ∼ P nX,Y . For α ∈ [0 , α max ] , we aimat characterizing Θ ( n ) q ( α ) := − sup A ∈ B ⊗ n X : P nX ( A ) ≤ e − nα nq log E Y ∼ P nY (cid:104) P nX | Y ( A | Y ) q (cid:105) (51) Θ ( n ) q ( α ) := − inf A ∈ B ⊗ n X : P nX ( A ) ≥ e − nα nq log E Y ∼ P nY (cid:104) P nX | Y ( A | Y ) q (cid:105) , (52)and Θ ( ∞ ) q and Θ ( ∞ ) q as their limits as n → ∞ . Our strong BL inequalities in Theorem 4 imply the following strong q -stability theorem. Theorem 7 (Strong q -Stability Theorem) . For any n ≥ and α ∈ [0 , α max ] , Θ ( n ) q ( α ) ≥ Θ q (cid:48) ( α ) , Θ ( n ) q ( α ) ≤ Θ q (cid:48) ( α ) , where Θ q (cid:48) ( α ) , Θ q (cid:48) ( α ) are respectively deﬁned in (41) and (42) . Moreover, for ﬁnite alphabets X , Y , these twoinequalities are asymptotically sharp as n → ∞ for given α > , i.e., Θ ( ∞ ) q ( α ) = Θ q (cid:48) ( α )Θ ( ∞ ) q ( α ) = Θ q (cid:48) ( α ) . Our results here strengthen and generalize the (single-function version of) forward small-set expansion theoremdue to Kahn, Kalai, and Linial [30] and the (single-function version of) reverse small-set expansion theorem dueto Mossel et al [28]. A PPENDIX AP ROOF OF P ROPOSITION Lemma 4.

Let P i , i ∈ [ n ] be n probability measures on X and µ be a nonnegative measure on X such that all P i (cid:28) µ, i ∈ [ n ] . Let s i , i ∈ [ n ] be n real numbers such that (cid:80) ni =1 s i = 1 . Let c j : X → [ −∞ , ∞ ] , j ∈ [ m ] be m measurable functions. Denote β := (cid:82) e − (cid:80) mj =1 c j (cid:81) ni =1 (cid:16) d P i d µ (cid:17) s i d µ . Then we have − log β = inf Q n (cid:88) i =1 s i D ( Q (cid:107) P i ) + m (cid:88) j =1 (cid:90) c j d Q, (53) where the conventions that · ∞ = 0 , and s = ∞ for s < , and for s = 0 are adopted, according to Convention1, the inﬁmization in the RHS of (53) is over all probability measures Q on X such that D ( Q (cid:107) P i ) < ∞ , ∀ i ∈ [ n ] and (cid:82) | c j | d Q < ∞ , ∀ j ∈ [ m ] . Moreover, if < β < ∞ , D ( Q ∗ (cid:107) P i ) < ∞ , ∀ i ∈ [ n ] , and (cid:82) | c j | d Q ∗ < ∞ , ∀ j ∈ [ m ] ,where Q ∗ is a probability measure with the density d Q ∗ d µ = e − (cid:80) mj =1 c j (cid:81) ni =1 (cid:16) d P i d µ (cid:17) s i β , (54) then the inﬁmization in the RHS of (53) is uniquely attained by Q ∗ .Proof: We ﬁrst consider the case < β < ∞ . For < a < b < ∞ , deﬁne A a,b := (cid:26) x : d P i d µ ( x ) , c j ( x ) ∈ ( a, b ) , ∀ i ∈ [ n ] , j ∈ [ m ] (cid:27) . Observe that for any Q with support contained in A a,b , we have D ( Q (cid:107) P i ) < ∞ , ∀ i ∈ [ n ] , (cid:82) | c j | d Q < ∞ , ∀ j ∈ [ m ] ,and moreover, n (cid:88) i =1 s i D ( Q (cid:107) P i ) + m (cid:88) j =1 (cid:90) c j d Q + log β a,b = D (cid:0) Q (cid:107) Q ∗ a,b (cid:1) ≥ , where β a,b := (cid:82) A a,b (cid:81) ni =1 (cid:16) d P i d µ (cid:17) s i d µ , and Q ∗ a,b is a probability measure with the density d Q ∗ a,b d µ := 1 A a,b e − (cid:80) mj =1 c j (cid:81) ni =1 (cid:16) d P i d µ (cid:17) s i β a,b . Moreover, the equality above holds if Q = Q ∗ a,b . Hence, − log β a,b = inf Q :supp( Q ) ⊆A a,b n (cid:88) i =1 s i D ( Q (cid:107) P i ) + m (cid:88) j =1 (cid:90) c j d Q. Taking inﬁmization over < a < b < ∞ for both sides above and applying the monotone convergence theorem,we have − log β = inf and D ( Q (cid:107) P i ) = ∞ , or (cid:82) | c j | d Q ∗ = ∞ . Hence, theinﬁmization at the RHS of (53) is taken over the empty set. Hence, both sides of (53) are ∞ . This completes theproof of Lemma 4.We may assume, by homogeneilty, that (cid:107) f (cid:107) p = (cid:107) g (cid:107) q = 1 . WLOG, we may also assume supp ( f ) ⊆ supp ( P X ) and supp ( g ) ⊆ supp ( P Y ) . Hence, we can write f p = d Q X d P X g q = d Q Y d P Y , for some probability measures Q X (cid:28) P X , Q Y (cid:28) P Y . Moreover, we require f < ∞ . Hence, Q X (cid:28)(cid:29) P X if p < . Similarly, Q Y (cid:28)(cid:29) P Y if q < .By this choice of f, g , we have H p, (cid:98) p ( f ) = p (cid:98) pp − (cid:98) p log (cid:107) f ( X ) (cid:107) p (cid:107) f ( X ) (cid:107) (cid:98) p = − pp − (cid:98) p log (cid:90) (cid:18) d Q X d P X (cid:19) (cid:98) p/p d P X = D (cid:98) pp ( Q X (cid:107) P X ) . (57)To ensure that D (cid:98) pp ( Q X (cid:107) P X ) exists, for the case p > > (cid:98) p , we still require Q X (cid:28)(cid:29) P X . Hence, Q X (cid:28) P X if p, (cid:98) p > ; otherwise, Q X (cid:28)(cid:29) P X . Similarly, Q Y (cid:28) P Y if q, (cid:98) q > ; otherwise, Q Y (cid:28)(cid:29) P Y .In addition, we also have that − log E [ f ( X ) g ( Y )] = − log (cid:90) (cid:18) d Q X d P X (cid:19) /p (cid:18) d Q Y d P Y (cid:19) /q d P XY = − log (cid:90) (cid:32) d (cid:0) Q X P Y | X (cid:1) d P XY (cid:33) /p (cid:32) d (cid:0) Q Y P X | Y (cid:1) d P XY (cid:33) /q d P XY = inf R XY (cid:26) D ( R XY (cid:107) P XY ) + 1 p D ( R X (cid:107) Q X ) − p D ( R X (cid:107) P X ) + 1 q D ( R Y (cid:107) Q Y ) − q D ( R Y (cid:107) P Y ) (cid:27) (58) = φ ( Q X , Q Y | P XY ) , (59)where (58) follows by Lemma 4, and moreover, in (58), the inﬁmization is taken over all R XY such that all therelative entropies appearing in the objective function are ﬁnite.Substituting (57) and (59) into the deﬁnitions of Λ p,q, (cid:98) p, (cid:98) q ( α, β ) and Λ p,q, (cid:98) p, (cid:98) q ( α, β ) yields Theorem 1.A PPENDIX BP ROOF OF T HEOREM p (cid:54) = (cid:98) p and q (cid:54) = (cid:98) q . The case of p = (cid:98) p or q = (cid:98) q follows similarly. A. Forward Case

Here we only prove the case p, q, (cid:98) p, (cid:98) q > . Other cases can be proven similarly.We ﬁrst consider the case n = 1 . Deﬁne ψ ( R XY ) = D ( R XY (cid:107) P XY ) − p D ( R X (cid:107) P X ) − q D ( R Y (cid:107) P Y )+ inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α p D ( R X (cid:107) Q X ) + inf Q Y : D (cid:98) qq ( Q Y (cid:107) P Y )= β q D ( R Y (cid:107) Q Y ) . Then Λ p,q, (cid:98) p, (cid:98) q ( α, β ) = inf R XY ψ ( R XY ) + αp + βq . (60)Denote λ = (cid:98) p/p, µ = (cid:98) q/q . By Lemma 4, D λ ( Q X (cid:107) P X ) = 11 − λ inf R X { λD ( R X (cid:107) Q X ) + (1 − λ ) D ( R X (cid:107) P X ) } . (61)Hence, for any R X , inf Q X : D λ ( Q X (cid:107) P X )= α p D ( R X (cid:107) Q X ) ≥ − λ (cid:98) p ( α − D ( R X (cid:107) P X )) . (62)On the other hand, by the nonnegtivity of relative entropies, inf Q X : D λ ( Q X (cid:107) P X )= α p D ( R X (cid:107) Q X ) ≥ . Hence inf Q X : D λ ( Q X (cid:107) P X )= α p D ( R X (cid:107) Q X ) ≥ (cid:20) − λ (cid:98) p ( α − D ( R X (cid:107) P X )) (cid:21) + , (63)where [ x ] + := x ∨ . Hence, Λ p,q, (cid:98) p, (cid:98) q ( α, β ) ≥ inf R XY D ( R XY (cid:107) P XY ) − p D ( R X (cid:107) P X ) − q D ( R Y (cid:107) P Y )+ (cid:20) − λ (cid:98) p ( α − D ( R X (cid:107) P X )) (cid:21) + + (cid:20) − µ (cid:98) q ( β − D ( R Y (cid:107) P Y )) (cid:21) + + αp + βq = inf R XY D ( R XY (cid:107) P XY ) + η p, (cid:98) p ( α, D ( R X (cid:107) P X )) + η q, (cid:98) q ( β, D ( R Y (cid:107) P Y )) . (64)Substituting ( α, β, P XY ) ← ( nα, nβ, P nXY ) , we obtain the n -dimensional version: Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) ≥ n inf R XnY n D ( R X n Y n (cid:107) P nXY ) + η p, (cid:98) p ( nα, D ( R X n (cid:107) P nX ))+ η q, (cid:98) q ( nβ, D ( R Y n (cid:107) P nY )) . (65)By the chain rule, we have D ( R X n Y n (cid:107) P nXY ) = n (cid:88) i =1 D (cid:0) R X i Y i | X i − Y i − (cid:107) P XY | R X i − Y i − (cid:1) (66) D ( R X n (cid:107) P nX ) = n (cid:88) i =1 D (cid:0) R X i | X i − (cid:107) P X | R X i − (cid:1) D ( R Y n (cid:107) P nY ) = n (cid:88) i =1 D (cid:0) R Y i | Y i − (cid:107) P Y | R Y i − (cid:1) . Consider the joint distribution R X n Y n K := R X n Y n Unif [ n ] , i.e., under this distribution, K ∼ Unif [ n ] (called random index ) is independent of ( X n , Y n ) . Denote X := X K , Y := Y K , U = (cid:0) X K − , K (cid:1) , V = (cid:0) Y K − , K (cid:1) , W =( U, V ) . Then D ( R X n Y n (cid:107) P nXY ) = nD (cid:0) R XY | W (cid:107) P XY | R W (cid:1) D ( R X n (cid:107) P nX ) = nD (cid:0) R X | U (cid:107) P X | R U (cid:1) ≤ D (cid:0) R X | W (cid:107) P X | R W (cid:1) D ( R Y n (cid:107) P nY ) = nD (cid:0) R Y | V (cid:107) P Y | R V (cid:1) ≤ D (cid:0) R Y | W (cid:107) P Y | R W (cid:1) . Substituting these into (65) yields that Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) ≥ inf R XY W D (cid:0) R XY | W (cid:107) P XY | R W (cid:1) + η p, (cid:98) p (cid:0) α, D (cid:0) R X | W (cid:107) P X | R W (cid:1)(cid:1) + η q, (cid:98) q (cid:0) β, D (cid:0) R Y | W (cid:107) P Y | R W (cid:1)(cid:1) = Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) . (67) B. Reverse Case

We only prove the case (cid:98) p, (cid:98) q > . Other cases can be proven similarly.Deﬁne Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( α, β ) as a variant of Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) , by removing sup Q W from the deﬁnition of Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) ,and moreover, taking Q W to be a Dirac measure. In the following, we ﬁrst prove Λ p,q, (cid:98) p, (cid:98) q ( α, β ) ≤ Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( α, β ) , (68)and then prove the n -dimensional version Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( nα, nβ | P nXY ) ≤ n Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) . (69)These yield the desired result Λ ( n ) p,q, (cid:98) p, (cid:98) q ( α, β ) ≤ Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) . (70)Denote λ = (cid:98) p/p, µ = (cid:98) q/q . Deﬁne ψ ( Q X , Q Y , R X , R Y ) = D ( R X , R Y (cid:107) P XY ) − p D ( R X (cid:107) P X ) − q D ( R Y (cid:107) P Y )+ 1 p D ( R X (cid:107) Q X ) + 1 q D ( R Y (cid:107) Q Y ) + αp + βq . Then φ ( Q X , Q Y | P XY ) = inf R X ,R Y ψ ( Q X , Q Y , R X , R Y ) . We next prove (68). We ﬁrst consider the case p, q > . Since Λ p,q, (cid:98) p, (cid:98) q ( α, β ) is symmetric w.r.t. p, (cid:98) p and also q, (cid:98) q ,it sufﬁces to only consider the case λ, µ > . We ﬁrst consider the case α, β > . Deﬁne R ∗ X as the distributionwith density w.r.t. P X being d R ∗ X d P X = (cid:16) d Q X d P X (cid:17) λ (cid:82) (cid:16) d Q X d P X (cid:17) λ d P X . (71)The existence of R ∗ X follows by the requirement D λ ( Q X (cid:107) P X ) = α < ∞ . Deﬁne R ∗ Y similarly. We ﬁrst assumethat D ( R ∗ X (cid:107) P X ) , D ( R ∗ Y (cid:107) Q Y ) < ∞ . Later we will consider the case D ( R ∗ X (cid:107) P X ) = ∞ or D ( R ∗ Y (cid:107) Q Y ) = ∞ . Itis easy to check that for D ( R ∗ X (cid:107) P X ) < ∞ , we have D ( R ∗ X (cid:107) P X ) + λ − λ D ( R ∗ X (cid:107) Q X ) = D λ ( Q X (cid:107) P X ) , (72)which, combined with the monotonicity of the Rényi divergence in its order, implies that for λ > , D ( Q X (cid:107) P X ) ≤ D λ ( Q X (cid:107) P X ) ≤ D ( R ∗ X (cid:107) P X ) . (73)Denote s = D ( Q X (cid:107) P X ) , (cid:98) s = D ( R ∗ X (cid:107) P X ) and t = D ( Q Y (cid:107) P Y ) , (cid:98) t = D ( R ∗ Y (cid:107) P Y ) . Then under the constraints D λ ( Q X (cid:107) P X ) = α, D µ ( Q Y (cid:107) P Y ) = β , we have ≤ s ≤ α, ≤ t ≤ β, α ≤ (cid:98) s ≤ α max , β ≤ (cid:98) t ≤ β max . (74)Moreover, we also have φ ( Q X , Q Y | P XY ) + αp + βq ≤ min R X ∈{ R ∗ X ,Q X } ,R Y ∈{ R ∗ Y ,Q Y } ψ ( Q X , Q Y , R X , R Y ) (75) ≤ (cid:98) φ ( Q X , Q Y , R ∗ X , R ∗ Y ) , (76)where (cid:98) φ ( Q X , Q Y , R X , R Y ) := min (cid:26) D ( R X , R Y (cid:107) P XY ) + α − (cid:98) s (cid:98) p + β − (cid:98) t (cid:98) q , D ( R X , Q Y (cid:107) P XY ) + α − (cid:98) s (cid:98) p + β − tq , D ( Q X , R Y (cid:107) P XY ) + α − sp + β − (cid:98) t (cid:98) q , D ( Q X , Q Y (cid:107) P XY ) + α − sp + β − tq (cid:27) , and in (76), (72) was used to eliminate the terms D ( R ∗ X (cid:107) Q X ) , D ( R ∗ Y (cid:107) Q Y ) . Then, relaxing Q X , Q Y , R ∗ X , R ∗ Y tobe arbitrary distributions satisfying (103), we obtain Λ p,q, (cid:98) p, (cid:98) q ( α, β ) ≤ sup ≤ s ≤ α, ≤ t ≤ β,α ≤ (cid:98) s ≤ α max ,β ≤ (cid:98) t ≤ β max sup Q X ,Q Y ,R X ,R Y : D ( Q X (cid:107) P X )= s,D ( Q Y (cid:107) P Y )= tD ( R X (cid:107) P X )= (cid:98) s,D ( R Y (cid:107) P Y )= (cid:98) t (cid:98) φ ( Q X , Q Y , R X , R Y ) = Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( α, β ) . (77)We now consider the case D ( R ∗ X (cid:107) P X ) = ∞ or D ( R ∗ Y (cid:107) Q Y ) = ∞ . Here we only provide a proof for thecase of D ( R ∗ X (cid:107) P X ) = D ( R ∗ Y (cid:107) Q Y ) = ∞ . The case of D ( R ∗ X (cid:107) P X ) < D ( R ∗ Y (cid:107) Q Y ) = ∞ and the case of D ( R ∗ Y (cid:107) Q Y ) < D ( R ∗ X (cid:107) P X ) = ∞ can be proven similarly. For the case of D ( R ∗ X (cid:107) P X ) = ∞ , we replace R ∗ X above as the distribution R ( r ) X with density w.r.t. P X being d R ( r ) X d P X = (cid:16) d Q X d P X (cid:17) λ A r (cid:82) (cid:16) d Q X d P X (cid:17) λ A r d P X (78)where A r := (cid:110) x : d Q X d P X ( x ) < r (cid:111) for r > . Then it is easy to verify that D (cid:16) R ( r ) X (cid:107) P X (cid:17) < ∞ and D (cid:16) R ( r ) X (cid:107) P X (cid:17) →∞ as r → ∞ . Moreover, analogue to (72), we have (1 − λ ) D (cid:16) R ( r ) X (cid:107) P X (cid:17) + λD (cid:16) R ( r ) X (cid:107) Q X (cid:17) = (1 − λ ) D λ ( Q X (cid:107) P X ) + ε r , (79)where ε r ≥ vanishes as r → ∞ . Hence, for sufﬁciently large r , D (cid:16) R ( r ) X (cid:107) P X (cid:17) ≥ α . By redeﬁning (cid:98) s = D (cid:16) R ( r ) X (cid:107) P X (cid:17) , we have that (74) still holds. Using (79) to replace (72), we still have (76), but with (cid:98) φ replaced bythe following (cid:98) φ r . (cid:98) φ r ( Q X , Q Y , R X , R Y ) := min (cid:26) D ( R X , R Y (cid:107) P XY ) + α − (cid:98) s + ε r (cid:98) p + β − (cid:98) t + ε r (cid:98) q , D ( R X , Q Y (cid:107) P XY ) + α − (cid:98) s + ε r (cid:98) p + β − tq , D ( Q X , R Y (cid:107) P XY ) + α − sp + β − (cid:98) t + ε r (cid:98) q , D ( Q X , Q Y (cid:107) P XY ) + α − sp + β − tq (cid:27) . Then, letting r → ∞ , we have (77). This completes the proof of (68) for the case p, q > .We next consider the case p, q < . For this case, we adopt a method similar to the above. We still deﬁne R ∗ X , R ∗ Y as above. We ﬁrst assume that D ( R ∗ X (cid:107) P X ) , D ( R ∗ Y (cid:107) Q Y ) < ∞ . For this case, similarly to the above, φ ( Q X , Q Y | P XY ) + αp + βq ≤ min (cid:26) ψ ( Q X , Q Y , R ∗ X , R ∗ Y ) , inf R Y ψ ( Q X , Q Y , R ∗ X , R Y ) , inf R X ψ ( Q X , Q Y , R X , R ∗ Y ) , inf R X ,R Y ψ ( Q X , Q Y , R X , R Y ) (cid:27) = (cid:98) φ ( R ∗ X , R ∗ Y ) , (80)where (cid:98) φ ( S X , S Y ) := inf R X ,R Y min (cid:26) D ( S X , S Y (cid:107) P XY ) + α − (cid:98) s (cid:98) p + β − (cid:98) t (cid:98) q , D ( S X , R Y (cid:107) P XY ) + α − (cid:98) s (cid:98) p + β − D ( R Y (cid:107) P Y ) q , D ( R X , S Y (cid:107) P XY ) + α − D ( R X (cid:107) P X ) p + β − (cid:98) t (cid:98) q , D ( R X , R Y (cid:107) P XY ) + α − D ( R X (cid:107) P X ) p + β − D ( R Y (cid:107) P Y ) q (cid:27) . Then, Λ p,q, (cid:98) p, (cid:98) q ( α, β ) ≤ sup S X ,S Y (cid:98) φ ( S X , S Y ) = Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( α, β ) , (81) i.e., (68). The cases of D ( R ∗ X (cid:107) P X ) = ∞ or D ( R ∗ Y (cid:107) Q Y ) = ∞ can be proven similarly as the above. Hence, weomit the proofs for these cases.The inequality (68) for the case p > > q can be proven similarly. However, note that for this case, (75) shouldbe replaced by φ ( Q X , Q Y | P XY ) + αp + βq ≤ min R X ∈{ R ∗ X ,Q X } min (cid:26) ψ ( Q X , Q Y , R X , R ∗ Y ) , inf R Y ψ ( Q X , Q Y , R X , R Y ) (cid:27) . We omit the proof for this case. By symmetry, the case p < < q also holds.Based on the inequality (68), in order to prove the desired inequality (70), it sufﬁces to prove (69), i.e., thefollowing lemma. Lemma 5. Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( nα, nβ | P nXY ) ≤ n Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β | P XY )

1) Proof of Lemma 5:

Here we only prove the case p, q, (cid:98) p, (cid:98) q > . Other cases can be proven similarly.For any reals b i,j , ( i, j ) ∈ [2] , deﬁne η ( R X , R Y , R X , R Y ) := min i,j ∈ [2] (cid:8) D (cid:0) R X i , R Y j (cid:107) P XY (cid:1) + b i,j (cid:9) , and (cid:98) η ( R X , R Y , R X , R Y ) := inf R X Y X Y ∈ C ( R X ,R Y ,R X ,R Y ) min i,j ∈ [2] (cid:8) D (cid:0) R X i Y j (cid:107) P XY (cid:1) + b i,j (cid:9) . (82) Lemma 6.

For any distributions R X , R Y , R X , R Y , η ( R X , R Y , R X , R Y ) = (cid:98) η ( R X , R Y , R X , R Y ) . Proof:

This lemma follows by swapping the two minimizations in (82).By deﬁnition, n Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( nα, nβ | P nXY )= sup α − λ ≤ s ≤ α + λ ,β − µ ≤ t ≤ β + µ ,α − /λ ≤ s ≤ α +1 /λ ,β − /µ ≤ t ≤ β +1 /µ sup R Xn ,R Y n ,R Xn ,R Y n : n D ( R Xni (cid:107) P nX ) = s i , n D (cid:16) R Y nj (cid:107) P nY (cid:17) = t j , ∀ i,j ∈ [2] f (cid:0) R X n , R Y n , R X n , R Y n (cid:1) , (83)where f (cid:0) R X n , R Y n , R X n , R Y n (cid:1) := min i,j ∈ [2] (cid:26) n D (cid:16) R X ni , R Y nj (cid:107) P XY (cid:17) + a i,j (cid:27) and a , := α − s p + β − t q , a , := α − s p + β − t (cid:98) qa , := α − s (cid:98) p + β − t q , a , := α − s (cid:98) p + β − t (cid:98) q . For ease of presentation, here we use ( R X , R Y , R X , R Y ) to denote ( Q X , Q Y , R X , R Y ) , and use ( s , t , s , t ) to denote (cid:0) s, t, (cid:98) s, (cid:98) t (cid:1) . By the lemma above, we can rewrite f (cid:0) R X n , R Y n , R X n , R Y n (cid:1) = inf R Xn Y n Xn Y n ∈ C ( R Xn ,R Y n ,R Xn ,R Y n ) min i,j ∈ [2] (cid:26) n D (cid:16) R X ni Y nj (cid:107) P XY (cid:17) + a i,j (cid:27) . Denote R IJ as a distribution on [2] . Then we can rewrite f (cid:0) R X n , R Y n , R X n , R Y n (cid:1) = inf R Xn Y n Xn Y n ∈ C ( R Xn ,R Y n ,R Xn ,R Y n ) min R IJ (cid:26) n D (cid:0) R X nI Y nJ (cid:107) P XY | R IJ (cid:1) + E R a I,J (cid:27) = min R IJ (cid:40) inf R Xn Y n Xn Y n ∈ C ( R Xn ,R Y n ,R Xn ,R Y n ) 1 n D (cid:0) R X nI Y nJ (cid:107) P XY | R IJ (cid:1) + E R a I,J (cid:41)

To single-letterize f (cid:0) R X n , R Y n , R X n , R Y n (cid:1) , we need the following “chain rule” for coupling sets. Lemma 7 (“Chain Rule” for Coupling Sets) . [31, Lemma 9] For a pair of conditional distributions ( P X n | W , P Y n | W ) ,we have n (cid:89) i =1 C ( P X i | X i − W , P Y i | Y i − W ) ⊆ C ( P X n | W , P Y n | W ) , where C ( P X i | X i − W , P Y i | Y i − W ) := (cid:8) Q X i Y i | X i − Y i − W : Q X i | X i − Y i − W = P X i | X i − W , Q Y i | X i − Y i − W = P Y i | Y i − W (cid:9) , i ∈ [ n ] and n (cid:89) i =1 C ( P X i | X i − W , P Y i | Y i − W ) := (cid:40) n (cid:89) i =1 Q X i Y i | X i − Y i − W : Q X i Y i | X i − Y i − W ∈ C ( P X i | X i − W , P Y i | Y i − W ) , i ∈ [ n ] (cid:41) . By this lemma, we obtain that inf R Xn Y n Xn Y n ∈ C ( R Xn ,R Y n ,R Xn ,R Y n ) D (cid:0) R X nI Y nJ (cid:107) P nXY | R IJ (cid:1) ≤ inf R Xn Y n Xn Y n ∈ (cid:81) nk =1 C ( R X ,k | Xk − ,R Y ,k | Y k − ,R X ,k | Xk − ,R Y ,k | Y k − ) D (cid:0) R X nI Y nJ (cid:107) P nXY | R IJ (cid:1) ≤ inf R X ,kY ,kX ,kY ,k | Xk − Y k − Xk − Y k − ∈ C ( R X ,k | Xk − ,R Y ,k | Y k − ,R X ,k | Xk − ,R Y ,k | Y k − ) ,k ∈ [ n ] n (cid:88) k =1 D k (84) = inf R X , Y , X , Y , ∈ C ( R X , ,R Y , ,R X , ,R Y , ) (cid:32) D + ... + inf R X ,n − Y ,n − X ,n − Y ,n − | Xn − Y n − Xn − Y n − ∈ C ( R X ,n − | Xn − ,R Y ,n − | Y n − ,R X ,n − | Xn − ,R Y ,n − | Y n − ) (cid:18) D n − + inf R X ,nY ,nX ,nY ,n | Xn − Y n − Xn − Y n − ∈ C ( R X ,n | Xn − ,R Y ,n | Y n − ,R X ,n | Xn − ,R Y ,n | Y n − ) D n (cid:19)(cid:33) ≤ inf R X , Y , X , Y , ∈ C ( R X , ,R Y , ,R X , ,R Y , ) (cid:32) D + ... + sup R Xn − Y n − Xn − Y n − ∈ C ( R Xn − ,R Y n − ,R | Xn − ,R Y n − ) inf R X ,n − Y ,n − X ,n − Y ,n − | Xn − Y n − Xn − Y n − ∈ C ( R X ,n − | Xn − ,R Y ,n − | Y n − ,R X ,n − | Xn − ,R Y ,n − | Y n − ) (cid:18) D n − + sup R Xn − Y n − Xn − Y n − ∈ C ( R Xn − ,R Y n − ,R Xn − ,R Y n − ) inf R X ,nY ,nX ,nY ,n | Xn − Y n − Xn − Y n − ∈ C ( R X ,n | Xn − ,R Y ,n | Y n − ,R X ,n | Xn − ,R Y ,n | Y n − ) D n (cid:19)(cid:33) = n (cid:88) k =1 sup R Xk − Y k − Xk − Y k − ∈ C ( R Xk − ,R Y k − ,R Xk − ,R Y k − ) inf R X ,kY ,kX ,kY ,k | Xk − Y k − Xk − Y k − ∈ C ( R X ,k | Xk − ,R Y ,k | Y k − ,R X ,k | Xk − ,R Y ,k | Y k − ) D k = n (cid:88) k =1 sup R W | K = k ∈ C ( R U | K = k ,R V | K = k ,R U | K = k ,R V | K = k ) inf R X Y X Y | W,K = k ∈ C ( R X | U ,K = k ,R Y | V ,K = k ,R X | U ,K = k ,R Y | V ,K = k ) D (cid:0) R X I Y J | W,K = k (cid:107) P XY | R W | K = k R IJ (cid:1) (85) = sup R W | K ∈ C ( R U | K ,R V | K ,R U | K ,R V | K ) inf R X Y X Y | WK ∈ C ( R X | U K ,R Y | V K ,R X | U K ,R Y | V K ) n (cid:88) k =1 D (cid:0) R X I Y J | W,K = k (cid:107) P XY | R W | K = k R IJ (cid:1) (86) = n sup R W | K ∈ C ( R U | K ,R V | K ,R U | K ,R V | K ) (cid:101) D ( R W K , R IJ ) (87)where D k := D (cid:16) R X I,k Y J,k | X k − Y k − X k − Y k − (cid:107) P XY | R X k − Y k − X k − Y k − R IJ (cid:17) ; (84) follows from the chain ruleand the fact that conditioning increases the relative entropy; in (85), we denote K ∼ R K := Unif [ n ] as arandom time index which is assumed to be independent of all other random variables; also in (85), we denote X i := X i,K , Y j := Y j,K , U i := X K − i , V j := Y K − j , W := ( U , V , U , V ) ; in (86), we swap the summation withthe supremization and inﬁmization which is feasible since the supremization and inﬁmization is taken for each termin the summation independently; and in (87), (cid:101) D ( R W K , R IJ ):= (cid:101) D (cid:0) R X | U K , R Y | V K , R X | U K , R Y | V K | R W K , R IJ (cid:1) := inf R X Y X Y | WK ∈ C ( R X | U K ,R Y | V K ,R X | U K ,R Y | V K ) D (cid:0) R X I Y J | W K (cid:107) P XY | R W K R IJ (cid:1) = E R WK g ( W, K ) with g ( w, k ) := inf R X Y X Y | W = w,K = k ∈ C ( R X | U u ,K = k ,R Y | V v ,K = k ,R X | U u ,K = k ,R Y | V v ,K = k ) D (cid:0) R X I Y J | W = w,K = k (cid:107) P XY | R IJ (cid:1) and w = ( u , v , u , v ) .Therefore, n (cid:98) φ (cid:0) R X n , R Y n , R X n , R Y n (cid:1) = min R IJ sup R W | K ∈ C ( R U | K ,R V | K ,R U | K ,R V | K ) (cid:110)(cid:101) D ( R W K , R IJ ) − E R a I,J (cid:111) = sup R W | K ∈ C ( R U | K ,R V | K ,R U | K ,R V | K ) min R IJ (cid:110)(cid:101) D ( R W K , R IJ ) − E R a I,J (cid:111) (88)where in the last line, the minimization and the supremization are swapped. The swap here is feasible by the minimaxtheorem [32, Theorem 2.10.2], since 1) the space of R IJ is the probability simplex, which is obviously a nonemptyconvex compact subset of the locally convex space R ; 2) the coupling set C ( R U | K , R V | K , R U | K , R V | K ) is anonempty convex subset of a linear space (the nonnegative measure space); 3) the objective function ( R W K , R IJ ) (cid:55)→ (cid:101) D ( R W K , R IJ ) − E R a I,J is linear in R W K given R IJ , and convex and continuous in R IJ given R W K . The convexity of the objective functionin R IJ (given R W K ) follows by the convexity of the relative entropy. By the convexity, we have that the objectivefunction is upper semicontinuous in R IJ on the whole probability simplex, and moreover, for any A ⊆ [2] , theobjective function is continuous in R IJ on the set of R IJ such that R IJ ( i, j ) > , ∀ ( i, j ) ∈ A . Let R IJ be anarbitrary point in the probability simplex, and denote its support as supp ( R IJ ) . For a sequence of R ( m ) IJ converging to R IJ , we can deﬁne a new sequence of distributions (cid:101) R ( m ) IJ := R ( m ) IJ R ( m ) IJ (supp( R IJ )) which obviously converges to R IJ as well. Then, by the nonnegativity of the relative entropy, we have (cid:101) D (cid:16) R W K , R ( m ) IJ (cid:17) ≥ R ( m ) IJ (supp ( R IJ )) (cid:101) D (cid:16) R W K , (cid:101) R ( m ) IJ (cid:17) → (cid:101) D ( R W K , R IJ ) (89)where the last step follows since R ( m ) IJ (supp ( R IJ )) → (by the assumption R ( m ) IJ → R IJ ) and (cid:101) D ( R W K , Q IJ ) iscontinuous in Q IJ on the set of Q IJ such that supp ( Q IJ ) = supp ( R IJ ) (as shown above). Combining (89) withthe upper semicontinuty of the objective function in R IJ yields the continuity of the objective function in R IJ .Substituting (88) into (83), we obtain that n Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( nα, nβ | P nXY ) ≤ sup α − λ ≤ s ≤ α + λ ,β − µ ≤ t ≤ β + µ ,α − /λ ≤ s ≤ α +1 /λ ,β − /µ ≤ t ≤ β +1 /µ sup R K ,R X U | K ,R Y V | K ,R X U | K ,R Y V | K : D ( R Xi | UiK (cid:107) P X | R Ui | K R K ) = s i ,D ( R Yj | VjK (cid:107) P Y | R Vj | K R K ) = t j , ∀ i,j ∈ [2] sup R W | K ∈ C ( R U | K ,R V | K ,R U | K ,R V | K ) min R IJ (cid:110)(cid:101) D (cid:0) R X | U K , R Y | V K , R X | U K , R Y | V K | R W K , R IJ (cid:1) + E R a I,J (cid:111) . Observe that the tuple consisting of all the distributions involved in the second and third supremizations above andthe joint distribution R K R W | K R X | U K R Y | V K R X | U K R Y | V K are mutually determined by each other. Hence,we can replace the second and third supremizations above with the supremization taken over the joint distributions R K R W | K R X | U K R Y | V K R X | U K R Y | V K (90)satisfying the constraints under the second supremization above.Denote (cid:99) W := ( W, K ) . Observe that for the joint distribution in (90), we have R X i | U i K = R X i | (cid:99) W and R Y j | V j K = R Y j | (cid:99) W . Hence, we further have n Λ ∗∗ p,q, (cid:98) p, (cid:98) q ( nα, nβ | P nXY ) ≤ sup α − λ ≤ s ≤ α + λ ,β − µ ≤ t ≤ β + µ ,α − /λ ≤ s ≤ α +1 /λ ,β − /µ ≤ t ≤ β +1 /µ sup R (cid:99) W R X | (cid:99) W R Y | (cid:99) W R X | (cid:99) W R Y | (cid:99) W : D ( R Xi | (cid:99) W (cid:107) P X | R (cid:99) W ) = s i ,D (cid:16) R Yj | (cid:99) W (cid:107) P Y | R (cid:99) W (cid:17) = t j , ∀ i,j ∈ [2] min R IJ (cid:110)(cid:101) D (cid:16) R X | (cid:99) W , R Y | (cid:99) W , R X | (cid:99) W , R Y | (cid:99) W | R (cid:99) W , R IJ (cid:17) + E R a I,J (cid:111) . (91)Observe that by deﬁnition, the inner minimization part above satisﬁes min R IJ (cid:110)(cid:101) D (cid:16) R X | (cid:99) W , R Y | (cid:99) W , R X | (cid:99) W , R Y | (cid:99) W | R (cid:99) W , R IJ (cid:17) − E a I,J (cid:111) = min R IJ  inf R X Y X Y | (cid:99) W ∈ C ( R X | (cid:99) W ,R Y | (cid:99) W ,R X | (cid:99) W ,R Y | (cid:99) W ) D (cid:16) R X I Y J | (cid:99) W (cid:107) P XY | R (cid:99) W R IJ (cid:17) + E R a I,J  = inf R X Y X Y | (cid:99) W ∈ C ( R X | (cid:99) W ,R Y | (cid:99) W ,R X | (cid:99) W ,R Y | (cid:99) W ) min R IJ (cid:110) D (cid:16) R X I Y J | (cid:99) W (cid:107) P XY | R (cid:99) W R IJ (cid:17) + E R a I,J (cid:111) = inf R X Y X Y | (cid:99) W ∈ C ( R X | (cid:99) W ,R Y | (cid:99) W ,R X | (cid:99) W ,R Y | (cid:99) W ) min i,j ∈ [2] (cid:110) D (cid:16) R X i Y j | (cid:99) W (cid:107) P XY | R (cid:99) W (cid:17) + a i,j (cid:111) = min i,j ∈ [2]  inf R X Y X Y | (cid:99) W ∈ C ( R X | (cid:99) W ,R Y | (cid:99) W ,R X | (cid:99) W ,R Y | (cid:99) W ) D (cid:16) R X i Y j | (cid:99) W (cid:107) P XY | R (cid:99) W (cid:17) + a i,j  = min i,j ∈ [2] (cid:110) D (cid:16) R X i | (cid:99) W , R Y j | (cid:99) W (cid:107) P XY | R (cid:99) W (cid:17) − a i,j (cid:111) . (92)From (92), we know that the RHS of (91) is equal to Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β | P XY ) . This completes the proof of Lemma 5. C. Sharpness

We next prove the asymptotic sharpness of the inequality in (8). The asymptotic sharpness of (9) follows similarly.Hence we omit the proof for the latter one.Let w be a sequence in W n with n -type T W . We choose f = (cid:88) T X | W e nµ TX | W T TX | W ( w ) (93) g = (cid:88) T Y | W e nν TY | W T TY | W ( w ) (94)where the summations are taken over all conditional n -types T X | W and T Y | W respectively. Denote s T X | W := D (cid:0) T X | W (cid:107) P X | T W (cid:1) . Then, n H p, (cid:98) p ( f ) = 1 n p (cid:98) pp − (cid:98) p log (cid:107) f ( X ) (cid:107) p (cid:107) f ( X ) (cid:107) (cid:98) p = 1 n (cid:98) pp − (cid:98) p log  (cid:88) T X | W P nX (cid:0) T T X | W ( w ) (cid:1) e npµ TX | W  − n pp − (cid:98) p log  (cid:88) T X | W P nX (cid:0) T T X | W ( w ) (cid:1) e n (cid:98) pµ TX | W  = (cid:98) pp − (cid:98) p max T X | W (cid:8) pµ T X | W − s T X | W (cid:9) − pp − (cid:98) p max T X | W (cid:8)(cid:98) pµ T X | W − s T X | W (cid:9) + o (1) . (95)Let (cid:110) T ∗ X | W (cid:111) n ≥ be a sequence of conditional types such that s T ∗ X | W = α + (cid:15) + o (1) for some constant (cid:15) > . Wechoose µ T X | W = max (cid:26) µ : (cid:98) pµ ≤ s T X | W , pµ ≤ s T X | W + (cid:18) p (cid:98) p − (cid:19) α (cid:27) , ∀ T X | W (cid:54) = T ∗ X | W . From (95), we know that n H p, (cid:98) p ( f ) = (cid:98) pp − (cid:98) p (cid:104) pµ T ∗ X | W − s T ∗ X | W (cid:105) + − pp − (cid:98) p (cid:104)(cid:98) pµ T ∗ X | W − s T ∗ X | W (cid:105) + + o (1)= (cid:98) pp − (cid:98) p (cid:104) pµ T ∗ X | W − ( α + (cid:15) ) (cid:105) + − pp − (cid:98) p (cid:104)(cid:98) pµ T ∗ X | W − ( α + (cid:15) ) (cid:105) + + o (1) . Note that t ∈ R ≥ (cid:55)→ (cid:98) pp − (cid:98) p [ pt − ( α + (cid:15) )] + − pp − (cid:98) p [ (cid:98) pt − ( α + (cid:15) )] + is continuous and could take any value on [0 , α + (cid:15) ] . Note that α ∈ [0 , α + (cid:15) ] . Hence, for sufﬁciently large n , thereexists a value µ T ∗ X | W such that n H p, (cid:98) p ( f ) = α. We choose µ T ∗ X | W as this value. For ν T Y | W in (94), we make asimilar choice. Therefore, − n log E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) (cid:98) q + α (cid:98) p + β (cid:98) q ≤ − n log  (cid:88) T X | W (cid:54) = T ∗ X | W ,T Y | W (cid:54) = T ∗ Y | W P nXY (cid:0) T T X | W ( w ) × T T Y | W ( w ) (cid:1) e nµ TX | W + nν TY | W  + α (cid:98) p + β (cid:98) q ≤ min type Q XY W : Q W = T W D (cid:0) Q XY | W (cid:107) P XY | Q W (cid:1) + η p, (cid:98) p (cid:0) α, D (cid:0) Q X | W (cid:107) P X | Q W (cid:1)(cid:1) + η q, (cid:98) q (cid:0) β, D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1)(cid:1) . (96)We further minimize the RHS of (96) over all T W , then we have − n log E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) (cid:98) q + α (cid:98) p + β (cid:98) q ≤ min type Q XY W D (cid:0) Q XY | W (cid:107) P XY | Q W (cid:1) + η p, (cid:98) p (cid:0) α, D (cid:0) Q X | W (cid:107) P X | Q W (cid:1)(cid:1) + η q, (cid:98) q (cid:0) β, D (cid:0) Q Y | W (cid:107) P Y | Q W (cid:1)(cid:1) = Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) + o (1) , where the last equality follows by the fact that the KL divergence D ( Q (cid:107) P ) is continuous in Q given P under thecondition Q (cid:28) P , and the fact that the set of types is dense in the probability simplex (i.e., the space of probabilitymeasures on the ﬁnite set W × X × Y ). Remark . It is worth nothing that indeed, for the case p, q, (cid:98) p, (cid:98) q > , choosing f, g as mixtures of the indicators oftwo (not necessarily all) conditional type classes is sufﬁcient to attain the asymptotic optimal value Λ ∗ p,q, (cid:98) p, (cid:98) q ( α, β ) .A PPENDIX CP ROOF OF T HEOREM (cid:107) (cid:98) g ( Y ) (cid:107) q =  sup g E [ (cid:98) g ( Y ) g ( Y )] (cid:107) g ( Y ) (cid:107) q (cid:48) q ≥ g E [ (cid:98) g ( Y ) g ( Y )] (cid:107) g ( Y ) (cid:107) q (cid:48) q < Setting (cid:98) g ( Y ) ← E [ f ( X ) | Y ] , we obtain the following equivalence: sup f ∈F α (cid:107) E [ f ( X ) | Y ] (cid:107) q (cid:107) f ( X ) (cid:107) (cid:98) p =  sup f ∈F α sup g E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) q (cid:48) q ≥ f ∈F α inf g E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) q (cid:48) q < , (97) inf f ∈F α (cid:107) E [ f ( X ) | Y ] (cid:107) q (cid:107) f ( X ) (cid:107) (cid:98) p =  inf f ∈F α sup g E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) q (cid:48) q ≥ f ∈F α inf g E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) q (cid:48) q < . (98)Hence, (37) for q ≥ and (38) for q < are implied by Theorem 1.We next consider the rest cases. To this end, we deﬁne the optimal maximin and minimax BL exponents respectively as ∆ (cid:101) p,q, (cid:98) p ( α | P XY ) := − sup f ∈F α inf g log E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) q , (99) (cid:101) ∆ p,q, (cid:98) p ( α | P XY ) := − inf f ∈F α sup g log E [ f ( X ) g ( Y )] (cid:107) f ( X ) (cid:107) (cid:98) p (cid:107) g ( Y ) (cid:107) q . (100) Similar to Lemmas 1 and 2, we also have the following lemma.

Lemma 8.

Both Λ (cid:101) p,q, (cid:98) p ( α | P XY ) := ∆ (cid:101) p,q, (cid:98) p ( α | P XY ) + α (cid:98) p and (cid:101) Λ p,q, (cid:98) p ( α | P XY ) := (cid:101) ∆ p,q, (cid:98) p ( α | P XY ) + α (cid:98) p are symmetricw.r.t. ( p, (cid:98) p ) . We also have the following information-theoretic characterizations.

Proposition 2 (Information-Theoretic Characterizations) . Λ (cid:101) p,q, (cid:98) p ( α ) = inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α sup Q Y φ ( Q X , Q Y | P XY ) + αp , (101) (cid:101) Λ p,q, (cid:98) p ( α ) = sup Q X : D (cid:98) pp ( Q X (cid:107) P X )= α inf Q Y φ ( Q X , Q Y | P XY ) + αp , (102) where φ was deﬁned in (7) . We also have Λ (cid:101) p,q, (cid:98) p ( α ) = inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α − q (cid:48) log (cid:90) h q (cid:48) Y d P Y + αp for q < , (103) (cid:101) Λ p,q, (cid:98) p ( α ) = sup Q X : D (cid:98) pp ( Q X (cid:107) P X )= α − q (cid:48) log (cid:90) h q (cid:48) Y d P Y + αp for q ≥ , (104) where h Y ( y ) := (cid:82) (cid:16) d Q X d P X (cid:17) /p d P X | Y = y . In all these optimization problems, Q X (cid:28) P X if p, (cid:98) p > , otherwise, Q X (cid:28)(cid:29) P X ; and moreover, Q Y (cid:28) P Y .Proof: (101) and (102) follow similarly to Lemma 4. (103) and (104) follow from the equivalence in (97) for q < and the one in (98) for q ≥ .Deﬁne Λ (cid:101) ∗ p,q, (cid:98) p ( α ) := (cid:40) inf Q W sup Q Y inf Q X | W θ p,q, (cid:98) p (cid:0) Q W , Q X | W , Q Y (cid:1) q > Q W ,Q X | W ,Q Y θ p,q, (cid:98) p (cid:0) Q W , Q X | W , Q Y (cid:1) q < where θ p,q, (cid:98) p (cid:0) Q W , Q X | W , Q Y (cid:1) := D (cid:0) Q X | W , Q Y (cid:107) P XY | Q W (cid:1) − D ( Q Y (cid:107) P Y ) q + η p, (cid:98) p (cid:0) α, D (cid:0) Q X | W (cid:107) P X | Q W (cid:1)(cid:1) . By Carathéodory’s theorem, without loss of optimality, it sufﬁces to restrict |W| ≤ .According to the signs of p, (cid:98) p , we partition the vector (cid:110) Q X | W , (cid:98) Q X | W (cid:111) into two subsets S + , S − . Speciﬁcally, Q X | W ∈ S + if p ≥ ; Q X | W ∈ S − otherwise. Similarly, (cid:98) Q X | W is assigned into S + , S − according to the signs of (cid:98) p . Deﬁne (cid:101) Λ ∗ p,q, (cid:98) p ( α ) :=  sup Q W sup S + inf S − min (cid:26) θ p,q (cid:0) Q W , Q X | W (cid:1) , θ (cid:98) p,q (cid:16) Q W , (cid:98) Q X | W (cid:17)(cid:27) q > −∞ q < where θ p,q (cid:0) Q W , Q X | W (cid:1) := inf Q Y D (cid:0) Q X | W , Q Y (cid:107) P XY | Q W (cid:1) − D ( Q Y (cid:107) P Y ) q + α − D (cid:0) Q X | W (cid:107) P X | Q W (cid:1) p and the inﬁmization is taken over all distributions in S − , and the supremization is taken over all distributions in S + under the constraints α − (cid:98) p/p ≤ D (cid:0) Q X | W (cid:107) P X | Q W (cid:1) ≤ α + (cid:98) p/p and α − p/ (cid:98) p ≤ D (cid:16) (cid:98) Q X | W (cid:107) P X | Q W (cid:17) ≤ α + p/ (cid:98) p if p, (cid:98) p > .By Carathéodory’s theorem, without loss of optimality, it sufﬁces to restrict |W| ≤ .We provide dimension-free bounds for Λ (cid:101) ( n ) p,q, (cid:98) p and (cid:101) Λ ( n ) p,q, (cid:98) p in the following theorem. The proof is given in AppendixD. Theorem 8 (Maximin and Minimax BL Exponents) . For any n ≥ , p, q, (cid:98) p ∈ R \ { } , and α ∈ (cid:40) [0 , α max ] p (cid:98) p > −∞ , p (cid:98) p < ,we have (cid:101) Λ ( n ) p,q, (cid:98) p ( α ) ≤ (cid:101) Λ ∗ p,q, (cid:98) p ( α ) . (105) Additionally, if Y is ﬁnite for q > , then we also have Λ (cid:101) ( n ) p,q, (cid:98) p ( α ) ≥ Λ (cid:101) ∗ p,q, (cid:98) p ( α ) . (106) Moreover, for ﬁnite alphabets X , Y , (106) for the case q < and the inequality in (105) for the case q ≥ areasymptotically sharp as n → ∞ . By the equivalence in (97) and (98), (37) for q < and (38) for q ≥ are implied by Theorem 8.A PPENDIX DP ROOF OF T HEOREM p (cid:54) = (cid:98) p . The case of p = (cid:98) p follows similarly. A. Forward Case

Here we only prove the case p, (cid:98) p > . Other cases can be proven similarly.Deﬁne ψ ( R Y , Q X , Q Y ) := inf R X | Y D ( R XY (cid:107) P XY ) − p D ( R X (cid:107) P X ) − q D ( R Y (cid:107) P Y )+ 1 p D ( R X (cid:107) Q X ) + 1 q D ( R Y (cid:107) Q Y )= − (cid:90) log h Y d R Y + 1 q (cid:48) D ( R Y (cid:107) P Y ) + 1 q D ( R Y (cid:107) Q Y ) . (107)Then Λ (cid:101) p,q, (cid:98) p ( α ) = inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α sup Q Y inf R Y ψ ( R Y , Q X , Q Y ) + αp . (108)We ﬁrst consider the case < q < . By Lemma 4, we have sup Q Y ψ ( Q Y , Q X , Q Y ) = − q (cid:48) log (cid:90) h q (cid:48) Y d P Y . (109)On the other hand, by (103), we know that Λ (cid:101) p,q, (cid:98) p ( α ) = inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α − q (cid:48) log (cid:90) h q (cid:48) Y d P Y + αp . (110)Therefore, we have Λ (cid:101) p,q, (cid:98) p ( α ) = inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α sup Q Y ψ ( Q Y , Q X , Q Y ) + αp = inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α sup Q Y inf R X | Y D (cid:0) R X | Y Q Y (cid:107) P XY (cid:1) − q D ( Q Y (cid:107) P Y ) (111) − p D (cid:0) Q Y ◦ R X | Y (cid:107) P X (cid:1) + 1 p D (cid:0) Q Y ◦ R X | Y (cid:107) Q X (cid:1) + αp ≥ sup Q Y inf R X | Y D (cid:0) R X | Y Q Y (cid:107) P XY (cid:1) − q D ( Q Y (cid:107) P Y )+ η p, (cid:98) p (cid:0) α, D (cid:0) Q Y ◦ R X | Y (cid:107) P X (cid:1)(cid:1) , (112) The supremization is taken over Q Y such that − (cid:82) | log h Y | d Q Y , D ( Q Y (cid:107) P Y ) < ∞ . By Convention 1, we do not write out theseconstraints. where (112) follows by derivations similar to (61)-(64).Substituting ( α, β, P XY ) ← ( nα, nβ, P nXY ) , we obtain the n -dimensional version: Λ (cid:101) ( n ) p,q, (cid:98) p ( α ) ≥ n sup R Y n inf R Xn | Y n D ( R X n Y n (cid:107) P nXY ) − q D ( R Y n (cid:107) P nY )+ η p, (cid:98) p ( α, D ( R X n (cid:107) P nX )) , where for ease of presentation, we replace Q Y n with R Y n . By the chain rule, we have Λ (cid:101) ( n ) p,q, (cid:98) p ( α ) ≥ inf R UV sup R Y | V inf R X | Y UV D (cid:0) R XY | UV (cid:107) P XY | R UV (cid:1) − q D (cid:0) R Y | V (cid:107) P Y | R V (cid:1) + η p, (cid:98) p (cid:0) α, D (cid:0) R X | U (cid:107) P X | R U (cid:1)(cid:1) ≥ inf R UV sup R Y inf R X | Y UV D (cid:0) R XY | UV (cid:107) P XY | R UV (cid:1) − q D ( R Y (cid:107) P Y )+ η p, (cid:98) p (cid:0) α, D (cid:0) R X | UV (cid:107) P X | R UV (cid:1)(cid:1) = Λ (cid:101) ∗ p,q, (cid:98) p ( α ) . where X := X K , Y := Y K , U := (cid:0) X K − , K (cid:1) , V := (cid:0) Y K − , K (cid:1) , and K ∼ Unif [ n ] is independent of ( X n , Y n ) .The case q < follows similarly to the above. Hence, we omit the proof.We next consider the case of q > . By assumption, we restrict Y to be a ﬁnite set. For this case, we have Λ (cid:101) p,q, (cid:98) p ( α ) = inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α sup Q Y inf R Y ψ ( R Y , Q X , Q Y ) + αp ≥ inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α sup y ψ ( δ y , Q X , δ y ) + αp where the inequality follows since here we choose Q Y to be the Dirac measure δ y for an element y ∈ Y , whichforces R Y = δ y as well (otherwise the objective function would be inﬁnity). We can rewrite ψ ( δ y , Q X , δ y ) = g ( δ y ) , (113)where g ( R Y ) := inf R X | Y D (cid:0) R X | Y (cid:107) P X | Y | R Y (cid:1) − p (cid:90) log Q X P X d (cid:0) R X ◦ R X | Y (cid:1) + 1 q (cid:48) D ( R Y (cid:107) P Y )= − (cid:90) log h Y d R Y + 1 q (cid:48) D ( R Y (cid:107) P Y ) . We next claim that sup y g ( δ y ) = sup R Y g ( R Y ) . (114)Obviously, g is convex. On the other hand, the set of probability measures with ﬁnite supports is convex, and itsextreme points are the set of Dirac measures δ y for elements y ∈ Y . Hence, this claim holds.Substituting (113) and (114) into (55), we have Λ (cid:101) p,q, (cid:98) p ( α ) ≥ inf Q X : D (cid:98) pp ( Q X (cid:107) P X )= α sup R Y g ( R Y ) + αp ≥ sup Q Y inf R X D ( R X , Q Y (cid:107) P XY ) − q D ( Q Y (cid:107) P Y )+ η p, (cid:98) p ( α, D ( R X (cid:107) P X )) , (115)where (115) follows by (63). Substituting ( α, β, P XY ) ← ( nα, nβ, P nXY ) , we obtain the n dimensional version (aninequality concerning Λ (cid:101) ( n ) p,q, (cid:98) p ( α ) ). We then apply the single-letterization steps similar to (66)-(67), and obtain thedesired inequality Λ (cid:101) ( n ) p,q, (cid:98) p ( α ) ≥ Λ (cid:101) ∗ p,q, (cid:98) p ( α ) . B. Reverse Case

Here we only prove the case p, (cid:98) p > . Other cases can be proven similarly.We ﬁrst consider the case n = 1 . Deﬁne ψ ( R XY , Q X ) := D ( R XY (cid:107) P XY ) − p D ( R X (cid:107) P X ) − q D ( R Y (cid:107) P Y )+ 1 p D ( R X (cid:107) Q X ) + inf Q Y q D ( R Y (cid:107) Q Y ) . Then, (cid:101) Λ p,q, (cid:98) p ( α ) = sup Q X : D (cid:98) pp ( Q X (cid:107) P X )= α inf R XY ψ ( R XY , Q X ) + αp . (116)For q < , inf Q Y q D ( T Y (cid:107) Q Y ) = −∞ . Hence (cid:101) Λ p,q, (cid:98) p ( α ) = −∞ . (117)We next consider the case q > . For this case, inf Q Y q D ( R Y (cid:107) Q Y ) = 0 and hence, ψ ( R X , R Y , Q X ) = D ( R XY (cid:107) P XY ) − p D ( R X (cid:107) P X ) − q D ( R Y (cid:107) P Y )+ 1 p D ( R X (cid:107) Q X ) . Deﬁne R ∗ X as the distribution with the density given in (71). We assume R ∗ X exists, otherwise, deﬁne the approx-imate version in (78) and adopt a proof scheme similar to that in Appendix B-B. Denote s = D ( Q X (cid:107) P X ) , (cid:98) s = D ( R ∗ X (cid:107) P X ) . Deﬁne Θ q ( R X ) := inf R Y | X D ( R XY (cid:107) P XY ) − q D ( R Y (cid:107) P Y ) . Then, we have inf R XY ψ ( R XY , Q X ) + αp ≤ min R X ∈{ R ∗ X ,Q X } Θ q ( R X ) + 1 p D ( R X (cid:107) Q X ) − p D ( R X (cid:107) P X ) + αp = min (cid:26) Θ q ( R ∗ X ) + α − (cid:98) s (cid:98) p , Θ q ( Q X ) + α − sp (cid:27) (118) =: (cid:98) φ ( R ∗ X , Q X ) , where in (118), (72) was used to eliminate the terms D ( R ∗ X (cid:107) Q X ) . Obviously, as in Appendix B-B, α − λ ≤ s ≤ α + λ , α − /λ ≤ (cid:98) s ≤ α +1 /λ still hold. Then, relaxing Q X , R ∗ X to be arbitrary distributions satisfying these two inequalities,we obtain Λ p,q, (cid:98) p, (cid:98) q ( α, β ) ≤ sup α − λ ≤ s ≤ α + λ ,α − /λ ≤ (cid:98) s ≤ α +1 /λ sup Q X ,R X : D ( Q X (cid:107) P X )= s,D ( R X (cid:107) P X )= (cid:98) s (cid:98) φ ( R X , Q X ) . (119)Lastly, substituting ( α, β, P XY ) ← ( nα, nβ, P nXY ) , we obtain the n -dimensional version. By the single-letterizationtechniques used in the proof of Lemma 5, we obtain the desired result in (105). We omit the proof detail here. C. Sharpness

By the equivalence in (97) and (98), verifying the asymptotic sharpness of the inequality in (106) for the case q < and the inequality in (105) for the case q ≥ is equivalent to verifying the asymptotic sharpness of theinequality in (37) for q < and the inequality in (38) for q ≥ . The asymptotic sharpness for the latter twoinequalities can be veriﬁed by choosing f as the one in (93), but the parameters µ T X are needed to be reset. Weomit the proof, since it is similar to the one in Appendix B-C.A PPENDIX EP ROOF OF L EMMA (cid:107) f (cid:107) = 1 . Hence we can write f = d Q X d P X This choice implies that log (cid:107) f (cid:107) p = 1 p log (cid:90) (cid:18) d Q X d P X (cid:19) p d P X = 1 p (cid:48) D p ( Q X (cid:107) P X ) and log E [ E [ f ( X ) | Y ] q ] /q = 1 q log (cid:90) (cid:18)(cid:90) d Q X d P X d P X | Y (cid:19) q d P Y = 1 q log (cid:90) (cid:18) d Q Y d P Y (cid:19) q d P Y = 1 q (cid:48) D q ( Q Y (cid:107) P Y ) where Q Y := Q X ◦ P Y | X . Therefore, Λ p,q, ( α | P XY ) − αp = inf Q X : D p ( Q X (cid:107) P X )= α − log E [ E [ f ( X ) | Y ] q ] /q (cid:107) f (cid:107) p = inf Q X : D p ( Q X (cid:107) P X )= α − q (cid:48) D q ( Q Y (cid:107) P Y ) + 1 p (cid:48) D p ( Q X (cid:107) P X )= inf Q X : D p ( Q X (cid:107) P X )= α − q (cid:48) D q ( Q Y (cid:107) P Y ) + αp (cid:48) , (120)and similarly, Λ p,q, ( α | P XY ) − αp = sup Q X : D p ( Q X (cid:107) P X )= α − q (cid:48) D q ( Q Y (cid:107) P Y ) + αp (cid:48) . (121)The equalities (120) and (121) imply that η p → q (cid:0) α | P X , P Y | X (cid:1) = (cid:40) q (cid:48) (cid:0) α − Λ p,q, ( α | P XY ) (cid:1) q > or q < q (cid:48) (cid:0) α − Λ p,q, ( α | P XY ) (cid:1) < q < and η p → q (cid:0) α | P X , P Y | X (cid:1) = (cid:40) q (cid:48) (cid:0) α − Λ p,q, ( α | P XY ) (cid:1) q > or q < q (cid:48) (cid:0) α − Λ p,q, ( α | P XY ) (cid:1) < q < R EFERENCES [1] Y. Polyanskiy and A. Samorodnitsky. Improved log-sobolev inequalities, hypercontractivity and uncertainty principle on the hypercube.

Journal of Functional Analysis , 277(11):108280, 2019.[2] N. Kirshner and A. Samorodnitsky. A moment ratio bound for polynomials and some extremal properties of Krawchouk polynomialsand Hamming spheres. arXiv preprint arXiv:1909.11929 , 2019.[3] L. Yu, V. Anantharam, and J. Chen. Graphs of joint types, noninteractive simulation,and stronger hypercontractivity. arXiv preprintarXiv:2102.00668 , Feb. 2021. [Online]. Available: https://arxiv.org/abs/2102.00668.[4] H. J. Brascamp and E. H. Lieb. Best constants in Young’s inequality, its converse, and its generalization to more than three functions.

Advances in Mathematics , 20(2):151–173, 1976.[5] F. Barthe. On a reverse form of the Brascamp–Lieb inequality.

Inventiones mathematicae , 134(2):335–361, 1998.[6] A. Bonami. Ensembles Λ( p ) dans le dual de D ∞ . In Annales de l’institut Fourier , volume 18, pages 193–204, 1968.[7] K. Kiener.

Uber Produkte von quadratisch integrierbaren Funktionen endlicher Vielfalt . PhD thesis, PhD thesis, Dissertation, UniversitätInnsbruck, 1969.[8] M. Schreiber. Fermeture en probabilité de certains sous-espaces d’un espace L . Zeitschrift für Wahrscheinlichkeitstheorie und VerwandteGebiete , 14(1):36–48, 1969.[9] A. Bonami. Étude des coefﬁcients de fourier des fonctions de L p ( G ) . In Annales de l’institut Fourier , volume 20, pages 335–402,1970.[10] L. Gross. Logarithmic sobolev inequalities.

American Journal of Mathematics , 97(4):1061–1083, 1975.[11] R. Ahlswede and P. Gács. Spreading of sets in product spaces and hypercontraction of the markov operator.

The Annals of Probability ,pages 925–939, 1976.[12] C. Borell. Positivity improving operators and hypercontractivity.

Mathematische Zeitschrift , 180(3):225–234, 1982.[13] E. Mossel, K. Oleszkiewicz, and A. Sen. On reverse hypercontractivity.

Geometric and Functional Analysis , 23(3):1062–1097, 2013.[14] E. A. Carlen and D. Cordero-Erausquin. Subadditivity of the entropy and its relation to Brascamp–Lieb type inequalities.

Geometricand Functional Analysis , 19(2):373–405, 2009.[15] C. Nair. Equivalent formulations of hypercontractivity using information measures. In

International Zurich Seminar , 2014.[16] S. Beigi and C. Nair. Equivalent characterization of reverse Brascamp-Lieb-type inequalities using information measures. In , pages 1038–1042. IEEE, 2016.[17] S. Kamath. Reverse hypercontractivity using information measures. In , pages 627–633. IEEE, 2015.[18] J. Liu. Information theory from a functional viewpoint.

Ph.D Dissertation, Princeton, NJ: Princeton University , 2018.[19] J. Bennett, A. Carbery, M. Christ, and T. Tao. The Brascamp–Lieb inequalities: ﬁniteness, structure and extremals.

Geometric andFunctional Analysis , 17(5):1343–1415, 2008.[20] T. A. Courtade and J. Liu. Euclidean forward–reverse Brascamp–Lieb inequalities: Finiteness, structure, and extremals.

The Journalof Geometric Analysis , pages 1–51, 2020.[21] A. D. Wyner and J. Ziv. A theorem on the entropy of certain binary sequences and applications: Part I.

IEEE Trans. Inf. Theory ,19(6):769–772, 1973.[22] H. Hsu, S. Asoodeh, S. Salamatian, and F. P. Calmon. Generalizing bottleneck problems. In , pages 531–535. IEEE, 2018.[23] T. Van Erven and P. Harremoës. Rényi divergence and Kullback-Leibler divergence.

IEEE Trans. Inf. Theory , 60(7):3797–3820, 2014.[24] V. Anantharam, A. Gohari, S. Kamath, and C. Nair. On hypercontractivity and a data processing inequality. In , pages 3022–3026. IEEE, 2014.[25] A. Dembo and O. Zeitouni.

Large Deviations Techniques and Applications . Springer-Verlag, 2nd edition, 1998.[26] M. Raginsky. Logarithmic sobolev inequalities and strong data processing theorems for discrete channels. In , pages 419–423. IEEE, 2013.[27] R. O’Donnell.

Analysis of Boolean Functions . Cambridge University Press, 2014.[28] E. Mossel, R. O’Donnell, O. Regev, J. E. Steif, and B. Sudakov. Non-interactive correlation distillation, inhomogeneous Markov chains,and the reverse Bonami-Beckner inequality.

Israel Journal of Mathematics , 154(1):299–336, 2006.[29] O. Ordentlich, Y. Polyanskiy, and O. Shayevitz. A note on the probability of rectangles for correlated binary strings.

IEEE Trans. Inf.Theory , 2020.[30] J. Kahn, G. Kalai, and N. Linial. The inﬂuence of variables on Boolean functions. In , pages 68–80. IEEE, 1988.[31] L. Yu and V. Y. F. Tan. On exact and ∞ -Rényi common informations. IEEE Trans. Inf. Theory , 66(6):3366–3406, 2020.[32] C. Zalinescu.