[PDF] Exact Error and Erasure Exponents for the Asymmetric Broadcast Channel

Abstract

Consider the asymmetric broadcast channel with a random superposition codebook, which may be comprised of constant composition or \iid codewords. By applying Forney's optimal decoder for individual messages and the message pair for the receiver that decodes both messages, exact (ensemble-tight) error and erasure exponents are derived. It is shown that the optimal decoder designed to decode the pair of messages achieves the optimal trade-off between the total and undetected exponents associated with the optimal decoder for the private message. Convex optimization-based procedures to evaluate the exponents efficiently are proposed. Finally, numerical examples are presented to illustrate the results.

Full PDF

aa r X i v : . [ c s . I T ] J un Exact Error and Erasure Exponents for theAsymmetric Broadcast Channel

Daming Cao,

Student Member, IEEE

Vincent Y. F. Tan,

Senior Member, IEEE

Abstract —Consider the asymmetric broadcast channel with arandom superposition codebook, which may be comprised ofconstant composition or i.i.d. codewords. By applying Forney’soptimal decoder for individual messages and the message pairfor the receiver that decodes both messages, exact (ensemble-tight) error and erasure exponents are derived. It is shown thatthe optimal decoder designed to decode the pair of messagesachieves the optimal trade-off between the total and undetectedexponents associated with the optimal decoder for the privatemessage. Convex optimization-based procedures to evaluate theexponents efﬁciently are proposed. Finally, numerical examplesare presented to illustrate the results.

Index Terms —Broadcast channels, Degraded Message Sets,Erasure decoding, Undetected Error, Error exponents, Super-position coding.

I. I

NTRODUCTION

A. Background and Related Works

The broadcast channel [2] has been extensively studied inmulti-user information theory. Although the capacity regionis still unknown, some special cases have been solved. Oneexample is the broadcast channel with degraded messagesets, also known as the asymmetric broadcast channel (ABC).For this channel, one receiver desires to decode both theprivate message m and the common message m while theother receiver desires to decode only m . This model can beapplied to a plethora of different scenarios; see Section I-Dfor concrete examples of broadcasting scenarios, taking intoaccount the variation we consider herein.The capacity region for the ABC was derived by K¨ornerand Marton and is well known [3]. The earliest work on errorexponents for the ABC is that by K¨orner and Sgarro [4], whoused a constant composition ensemble for deriving an achiev-able error exponent. Later, Kaspi and Merhav [5] improvedthis work by deriving a tighter lower bound for the errorexponent by analyzing the ensemble of i.i.d. random codes.Most recently, Averbuch et al. derived the exact random codingerror exponents and expurgated exponents for the ensemble ofconstant composition codes in [6] and [7], respectively. D. Cao is with the Southeast University of China (e-mail: [email protected]).V. Y. F. Tan is with the National University of Singapore (e-mail:[email protected]).D. Cao is supported by the China Scholarship Council (No. 201706090064)and the National Natural Science Foundation of China under Grant No.61571122. V. Y. F. Tan is supported by a Singapore National ResearchFoundation (NRF) Fellowship (R-263-000-D02-281).This paper was presented in part at the 2018 IEEE International Symposiumon Information Theory [1].Copyright (c) 2017 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

In this paper, we are interested in decoders with an era-sure option. In this setting, the decoders may, instead ofdeclaring that a particular message or set of messages issent, output an erasure symbol. For the discrete memorylesschannel (DMC), Forney [8] found the optimal decoder andderived a lower bound on the total and undetected error expo-nents using Gallager-style bounding techniques. Csisz´ar andK¨orner [9, Thm. 10.11] derived universally attainable erasureand error exponents using a generalization of the maximummutual information (MMI) decoder. Telatar [10] also analyzedan erasure decoding rule with a general decoding metric.Moulin [11] generalized this family of decoders and proposeda new decoder parameterized by a weighting function. Mer-hav [12] derived lower bounds to these exponents by usinga novel type-class enumerator method. In a breakthrough,Somekh-Baruch and Merhav [13] derived the exact randomcoding exponents for erasure decoding. Recently, Huleihel etal. [14] showed that the random coding exponent for erasuredecoding is not universally achievable and established a simplerelation between the total and undetected error exponents.Weinberger and Merhav [15] analyzed a simpliﬁed decoderfor erasure decoding. Hayashi and Tan [16] derived ensemble-tight moderate deviations and second-order results for erasuredecoding over additive DMCs. For the ABC, Tan [17] derivedlower bounds on the total and undetected error exponentsof an extended version of the universal decoder in Csisz´arand K¨orner [9, Thm. 10.11]. Moreover, Merhav in anotherlandmark work in [18] analyzed a random coding scheme witha binning (superposition coding) structure and showed that apotentially suboptimal bin index decoder achieves the randomcoding error exponent for decoding only the bin index.

B. Main Contributions

In this paper, we consider erasure decoding for the ABCwith a superposition codebook structure, in which the distri-bution of the codewords is either i.i.d. or constant composition.For the decoder that aims to decode both messages, there aresix exponents of interest—the total and undetected exponentscorresponding to the individual messages m and m and thepair of messages ( m , m ) . We derive exact (ensemble-tight)exponents for this problem. The main technical contribution toobtain the exact random coding exponents is a set of tools tohandle statistical dependencies between codewords that sharethe same cloud center. To wit, Lemmas 6 and 7 consists oftwo technical results that are to establish the equality betweenthe total random coding error exponents pertaining to theﬁrst message (i.e., the private message m ) and the message ✲✻ Threshold T E xpon e n t s E u1 = E u Y E t1 = E t Y E u2 E t2 ✻❄ T ✻❄ T Fig. 1. For a ﬁxed rate pair ( R , R ) with varying threshold T , the ﬁgure schematically illustrates E u1 , E t1 , E u2 , E t2 , E u Y , and E t Y which are respectively,the undetected exponent for decoding m , the total exponent for decoding m , the undetected exponent for decoding m , the total exponent for decoding m , the undetected exponent for decoding ( m , m ) , and the total exponent for decoding ( m , m ) . Note that E u1 = E u Y and E t1 = E t Y so decoding m optimally and ( m , m ) optimally result in the same undetected-total exponent trade-off. Clearly, the same is not true of optimal decoding of m andoptimal decoding of ( m , m ) . pair. This ameliorates the dependency problem, at least on theexponential scale, which is the asymptotic regime of interest.We show that the minimizations required to evaluate theseerror exponents can be cast as convex optimization problems,and thus, can be solved efﬁciently using off-the-shelf convexoptimization solvers such as CVX. As such, it is computation-ally tractable to compare the performance of practical codesto the information-theoretic limits presented here; this guidesthe design and analysis of future generations of codes. Wepresent numerical examples to illustrate these exponents andthe trade-offs involved in the erasure decoding problem forthe ABC. We additionally show that the constant compositionexponents are, in general, larger than the i.i.d. exponents. C. Motivation, Signiﬁcance, Insights Gleaned, and a Surprise

Our motivation is to ﬁnd exact (ensemble-tight) erasure anderror exponents for the ABC and from the resulting form ofthe exponents, hope to gain valuable insights into the varioustrade-offs that are present. In particular, we are interestedin whether the optimal decoder for the pair of messages ( m , m ) (at the receiver that is required to decode bothmessages) performs as well as that for decoding only theprivate message m or, for that matter, the common message m . Our main observation is that the optimal decoder for ( m , m ) achieves the optimal trade-off between the total andundetected exponents pertaining to m . What are the practicalengineering implications and signiﬁcance of this ﬁnding? Ina broadcasting setting, the punchline of this paper says thatif a communication engineer has the erasure option—e.g.,in automatic repeat request/query (ARQ) [8] systems—anddesires to only to decipher the private message m , she canessentially obtain the other (common) message m for free using a decoder designed to decode both m and m . By“for free”, we mean that the optimal trade-off in the total and undetected exponents for—i.e., the performance of—decoding m is the same as that for ( m , m ) . In view ofthe packing lemma [19, Lemma 3.1] as applied to broadcastchannels [19, Chapters 5 and 8], this observation might seemnatural or unsurprising in the rate or capacity sense. However,what we show is much more—indeed, a reﬁned asymptoticresult. Our main observation and insight gleaned , which issurprising , implies that on the exponential scale —i.e., in termsof error and erasure exponents—there is no loss in the trade-off whether we choose to decode m or ( m , m ) . On theother hand, if the engineer desires to decode only m , sheneeds to design a dedicated decoder for this task since theoptimal trade-off in the total and undetected exponents for thejoint decoder is, in general, worse than that of the dedicatedone for m . This is illustrated schematically in Figure 1. D. Practical, Real-Life Examples

Let us provide practical, real-life examples for which theabove theoretical observation is applicable.On Boxing Day in 2004, the massive Indian Ocean earth-quake and tsunami struck. Its epicenter was off the westcoast of northern Sumatra, Indonesia. This event resulted ina tremendous loss of lives (roughly a quarter million) andproperty (roughly worth USD $15 billion) to Indonesia, SriLanka, Myanmar, Thailand, the Maldives, and even countriesas far as Somalia in East Africa in which damage was presentbut markedly less severe. See Figure 2. Since then, tsunamiwarning systems have been set up in Indonesia among othercountries. These warning systems (such as DART ® or Deep-ocean Assessment and Reporting of Tsunamis) are used detecttsunamis in advance and to issue warnings to people thatmight be adversely affected; see [20], [21]. Often, variousdisparate pieces of information need to be disseminated orbroadcast to common folk reliably. For example, those in the (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)✒ Needs to decode only location of epicenter m ❅❅❅❅❅❅❅❅❅❅■ Needs to decode both location and actions ( m , m ) Fig. 2. A map of countries affected by the 2004 Indian Ocean tsunami and its epicenter. In Sumatra, both m and m should be broadcasted to the residents.Decoding both is just as reliable, in the sense of total and undetected exponents, as decoding only m . Thus, the main take-home message of this paper isthat m comes for free (i.e., E t1 = E t Y and E u1 = E u Y in Figure 1). In Somalia, though, m , the location of the tsunami’s epicenter is the more salient pieceof information and m does not have to be known to the populace since they are unlikely to be required to take any substantial action. The performance ofdecoding m alone is not the same as decoding ( m , m ) and a dedicated decoder should be designed for the former (i.e., for Somalia). The ﬁgure, apartfrom annotations, is taken from https://en.wikipedia.org/wiki/2004 Indian Ocean earthquake and tsunami. direct path of the tsunami may need to know m , the actions they should take to avoid loss of lives (e.g., move to higherground) and m , the locations in which the tsunami will makelandfall and the corresponding severities . For such countries,our result says that if the optimal decoder for ( m , m ) isused, the performance, as deﬁned in Section I-C, is the sameas that for decoding only m . Hence, the take-home messageis that the residents of Sumatra will, in addition to the actionsthey need to take, also know the locations the tsunami makeslandfall. This can be done without any loss of optimalityfrom the perspective of the trade-off between the error anderasure exponents . For countries that are far away from a majorfault line such as Somalia, perhaps the design of a decoderfor only m is needed since the presence of the tsunamiin Southeast Asia is not likely to require any drastic actionfrom Somalians, so information about m is not needed there.In this case, the Somalian authorities and engineers need todesign a dedicated decoder to ensure optimality of decoding m with the erasure option. Note that since tsunami warningsystems have the potential to save hundreds of thousands oflives, they have to be ultra reliable. As such, our error anderasure formulation, in which the undetected exponents aredesigned to be larger than their erasure counterparts (andhence the undetected error probability is exponentially smallerthan its erasure counterpart), is of particular relevance in thiscritical setting. In sum, the ﬁndings of our paper have thepotential to guide the design and analysis of ultra-reliableinfrastructure with varying demands, such as next-generationtsunami warning systems.Another example comes from vehicle-to-vehicle (V2V)communications [22]. In these systems, vehicles form a com-munication network in which the vehicles themselves arecommunicating nodes. Through wireless transmissions, theyprovide each other with crucial information to enhance the X Y needs ( m , m ) Z needs m Fig. 3. A V2V communication network [22] in which receivers have differentdemands safety of all vehicles involved and, in particular, to preventaccidents. For a concrete example, let us consider threevehicles X , Y and Z ; see Figure 3. X is in close proximity to Y and thus the two vehicles are likely to collide if no furtheraction is taken. On the other hand, Z is farther away from X than Y is. Hence, in this ultra-reliable setting, X desiresto transmit m , the course of actions Y should take to avoidthe crash and m , its own location. The good news from ourresult says that using the optimal decoder, there is no lossin optimality in decoding both messages vis-`a-vis only m .Since Z does not need to take any actions at this point intime, it does not need to know m and instead only needs todecode m . The optimal decoder for Z needs to be designeddifferently from that for m and ( m , m ) .II. P ROBLEM F ORMULATION

A. Notation

Throughout this paper, random variables (RVs) will bedenoted by upper case letters, their speciﬁc values will be denoted by the respective lower case letters, and theiralphabets will be denoted by calligraphic letters. A simi-lar convention will apply to random vectors of dimension n ∈ N and their realizations. For example, the randomvector X n = ( X , . . . , X n ) may take on a certain realization x n = ( x , . . . , x n ) in X n , the n -th order Cartesian power of X , which is the alphabet of each component of this vector.The distributions associated with random variables will bedenoted by the letters P or Q , with subscripts being thenames of the random variables, e.g., Q UXY stands for ajoint distribution of a triple of random variables ( U, X, Y ) on U × X × Y , the Cartesian product alphabets of U , X and Y . Inaccordance with these notations, the joint distribution inducedby Q Y and Q X | Y will be denoted by Q XY := Q Y Q X | Y .Information measures induced by the joint distribution Q XY (or Q for short) will be subscripted by Q . For example, I Q ( X ; Y ) denotes the mutual information of the randomvariables X and Y with joint distribution Q = Q XY .For a sequence x n , let ˆ P x n denote its empirical distributionor type. The type class T P X of P X is the set of all x n whose empirical distribution is P X . For a given conditionalprobability distribution P X | U and sequence u n , T P X | U ( u n ) denotes the conditional type class of x n ( P X | U -shell) given u n , namely, the set of sequences x n whose joint empiricaldistribution with u n is given by P X | U ˆ P u n .The probability of an event E will be denoted by Pr {E} , andthe expectation operator with respect to a joint distribution Q ,will be denoted by E Q {·} . For two positive sequences { a n } and { b n } , the notation a n . = b n means that { a n } and { b n } are of the same exponential order, i.e., lim n →∞ n ln a n b n = 0 .Similarly, a n · ≤ b n means that lim sup n →∞ n ln a n b n ≤ . Theindicator function of an event E will be denoted by {E} .The notation | x | + will stand for max { x, } and notation [ M ] stands for { , . . . , M } . Finally, logarithms and exponents willbe understood to be taken to the natural base. B. System Model

We consider a discrete memoryless ABC W : X → Y ×Z with a ﬁnite input alphabet X , ﬁnite output alphabets Y and Z and a transition probability matrix { W ( y, z | x ) : x ∈X , y ∈ Y , z ∈ Z} . Let W Y : X → Y and W Z : X → Z berespectively the Y - and Z -marginals of W .Assume there is a random codebook C with superpositionstructure for this ABC, where the message pair ( m , m ) isdestined for user Y and the common message m is destinedfor user Z . In this paper, we consider i.i.d. random codes andconstant composition random codes. • For i.i.d. random codes, ﬁx a distribution P UX ( u, x ) and randomly generate M = e nR “cloud centers” { U n ( m ) : m ∈ M = [ M ] } according to thedistribution P ( u n ) := n Y i =1 P U ( u i ) . (1)For each cloud center U n ( m ) , randomly generate M = e nR “satellite” codewords { X n ( m , m ) : m ∈ M = [ M ] } according to the conditional probability distribu-tion P ( x n | u n ) := n Y i =1 P X | U ( x i | u i ) . (2) • For constant composition random codes, we ﬁx a jointtype P UX and randomly and independently generate M = e nR “cloud centers” { U n ( m ) : m ∈ M =[ M ] } under the uniform distribution on the type class T P U . For each cloud center U n ( m ) , randomly andindependently generate M = e nR “satellite” codewords { X n ( m , m ) : m ∈ M = [ M ] } under the uniformdistribution on the conditional type class T P X | U ( U n ( m )) The two decoders with erasure options are given by g : Y n → ( M ∪ { e } ) × ( M ∪ { e } ) and g : Z n → M ∪ { e } where e is the erasure symbol. C. Deﬁnitions of Error Probabilities and Error Exponents

In this paper, there are essentially twelve error probabilitiesunder consideration: the error probabilities of decoding thepair of messages, the error probabilities of decoding the privatemessage m only and the error probabilities of decoding thecommon message m only. For each of these probabilities,there are the total and undetected error probabilities, andeach can be computed at any of the terminals. We focus onsix different error probabilities associated to terminal Y . Wedo not derive the total and undetected error probabilities atterminal Z since the analysis is completely analogous to theanalysis of the error and erasure probabilities of the “cloudcenters” at terminal Y by replacing W Y with W Z . However,we do compute these exponents numerically in Section V-C(see Figure 9). Deﬁne the disjoint decoding regions accordingto the decoder g as D m m := { y n : g ( y n ) = ( m , m ) } .Moreover, let {D m : m ∈ M } and {D m : m ∈ M } be the disjoint decoding regions associated to messages m and m respectively. For terminal Y , deﬁne for message m j , j = 1 , and the message pair ( m , m ) , the conditionaltotal error and undetected error probabilities as e t j ( m , m ) := W n Y (cid:0) D cm j (cid:12)(cid:12) x n ( m , m ) (cid:1) (3) e u j ( m , m ) := W n Y (cid:18) [ ˆ m j ∈M j \{ m j } D ˆ m j (cid:12)(cid:12)(cid:12)(cid:12) x n ( m , m ) (cid:19) (4) e t Y ( m , m ) := W n Y (cid:0) D cm ,m (cid:12)(cid:12) x n ( m , m ) (cid:1) (5) e u Y ( m , m ) := W n Y (cid:18) [ ( ˆ m , ˆ m ) =( m ,m ) D ˆ m ˆ m (cid:12)(cid:12)(cid:12)(cid:12) x n ( m , m ) (cid:19) . (6)Then we may deﬁne the average total and undetected errorprobabilities at terminal Y as follows: e kj := 1 M M X ( m ,m ) ∈M ×M e kj ( m , m ) , k ∈ { t , u } (7) e kY := 1 M M X ( m ,m ) ∈M ×M e kY ( m , m ) , k ∈ { t , u } . (8)Using the Neyman-Pearson theorem, Forney [8] obtainedthe optimal trade-off between the average total and undetected error probabilities for discrete memoryless channels. By fol-lowing his idea and using a similar argument, we can show thatthe optimal trade-off between the average total and undetectederror probabilities for the ABC is attained by the followingdecoding regions D ∗ m j := (cid:26) y n : Pr( y n |C j ( m j )) P m ′ j = m j Pr( y n |C j ( m ′ j )) ≥ e nT (cid:27) , (9) D ∗ m m := (cid:26) y n : W n Y ( y n | x n ( m , m )) P ( m ′ ,m ′ ) =( m ,m ) W n Y ( y n | x n ( m ′ , m ′ )) ≥ e nT (cid:27) , (10)where the distribution of the output y n conditioned on thesubcodebook C ( m ) = { x n ( m , m ) : m ∈ M } is Pr( y n |C ( m )) := 1 M X m ∈M W n Y ( y n | x n ( m , m )) (11)and similarly for Pr( y n |C ( m )) .We would like to ﬁnd the exact error exponents E t j , E u j , E t Y and E u Y , j = 1 , with the erasure option, i.e., T ≥ (wedo not consider the list decoding mode, i.e., T < , in thispaper). These are the exponents associated to the expectationof the error probabilities, where the expectation is taken withrespect to the randomness of the codebook C which possessthe superposition structure as described in Section II-B. Inother words, E t1 ( R , R , T ) := lim sup n →∞ (cid:20) − n ln E C [ e t1 ] (cid:21) , (12)and similarly for the other exponents E u1 , E t Y , E t Y , E t2 , and E u2 . We show, in fact, that the lim sup in (12) is a limit. Theseexponents are also called random coding error exponents . Ifthese exponents are known exactly, we say that ensemble-tight results are established.III. M AIN R ESULTS AND D ISCUSSIONS

The main result in this paper are stated below in Theorems 1and 2, establishing exact random coding error exponentsfor the messages m j , j = 1 , , and the message pair atterminal Y , i.e., the random coding exponents correspondingto the probabilities in (7)–(8).Before stating our results, we state a few additional deﬁ-nitions. For a given probability distribution Q = Q UXY on U × X × Y , rates R and R , and the ﬁxed random codingdistribution P = P UX , deﬁne β ( Q, R ) := D ( Q X | U k P X | U | Q U )+ I Q ( X ; Y | U ) − R (13) γ ( Q, R ) := D ( Q U k P U ) + I Q ( U ; Y ) − R (14) Φ( Q, R , R ) := (cid:12)(cid:12) γ ( Q, R ) + | β ( Q, R ) | + (cid:12)(cid:12) + (15) ∆( Q, R , R ) := (cid:12)(cid:12) | − γ ( Q, R ) | + − β ( Q, R ) (cid:12)(cid:12) + . (16) In the following, the threshold T may take different values depending onwhether we are decoding individual messages or the message pair. A. Main Results

Theorem 1.

For i.i.d. random codes, the error exponents E t1 , E u1 , E t Y and E u Y are given by E t1 ( R , R , T ) = E t Y ( R , R , T )= min { Ψ a ( R , R , T ) , Ψ b ( R , T ) } (17) E u1 ( R , R , T ) = E u Y ( R , R , T )= E t1 ( R , R , T ) + T (18) where Ψ a ( R , R , T ):= min ˆ Q UXY h D ( ˆ Q UXY k P UXY )+ min Q UX | Y ∈L ( ˆ Q XY ,R ,R ,T ) Φ( Q UX | Y ˆ Q Y , R , R ) i (19) Ψ b ( R , T ):= min ˆ Q UXY h D ( ˆ Q UXY k P UXY )+ min Q X | UY ∈L ( ˆ Q UXY ,R ,T ) | β ( Q X | UY ˆ Q UY , R ) | + i (20) with P UXY ( u, x, y ) := P UX ( u, x ) W Y ( y | x ) and the sets L and L are deﬁned as L ( ˆ Q XY , R , R , T ):= n Q UX | Y : E Q ln 1 W Y + E ˆ Q ln W Y − T ≤ ∆( Q, R , R ) o (21) L ( ˆ Q UXY , R , T ):= n Q X | UY : E Q ln 1 W Y + E ˆ Q ln W Y − T ≤ | − β ( Q, R ) (cid:12)(cid:12) + o , (22) where Q in (21) is equal to Q = Q UX | Y ˆ Q Y , Q in (22) is equalto Q = Q X | UY ˆ Q UY , and the expectation E ˆ Q ln W Y can beexplicitly written as P u,x,y ˆ Q UXY ( u, x, y ) ln W Y ( y | x ) .For constant composition random codes, the correspondingerror exponents E t1 , E u1 , E t Y and E u Y can be obtained byadding additional constraints to the optimization problemsthat deﬁne the i.i.d. random coding error exponents above.In particular, all joint distributions Q UXY and ˆ Q UXY thatappear in (19) – (22) should satisfy the marginal constraint Q UX = P UX . For example, the corresponding exponent Ψ ′ a for constant composition random codes is given by e Ψ a ( R , R , T ):= min ˆ Q UXY : ˆ Q UX = P UX h D ( ˆ Q UXY k P UXY )+ min Q UX | Y ∈ e L ( ˆ Q XY ,R ,R ,T ) Φ( Q UX | Y ˆ Q Y , R , R ) i (23) In the following analyses and derivations, for ease of notation, wesometimes drop the dependencies of the error exponents (including those inTheorem 2) on the parameters ( R , R , T ) . and the set e L is deﬁned as e L ( ˆ Q XY , R , R , T ):= n Q UX | Y : Q UX = P UX , E Q ln 1 W Y + E ˆ Q ln W Y − T ≤ ∆( Q, R , R ) o (24) where Q in (24) is equal to Q = Q UX | Y ˆ Q Y and Q UX in (24) is the ( U × X ) -marginal distribution of Q . The proof of Theorem 1 is provided in Section VI. Itcan be shown that there exists a sequence of (deterministic)codebooks which can simultaneously achieve these followingexponents in Theorems 1 and 2 by using Markov’s inequality.(cf. [16, Proof of Theorem 1]).

Theorem 2.

For i.i.d. random codes, the error exponents E t2 and E u2 are given by E t2 ( R , R , T )= max { Ψ a ( R , R , T ) , Ψ c ( R , R , T ) } , (25) E u2 ( R , R , T )= E t2 ( R , R , T ) + T, (26) where Ψ c ( R , R , T ):= min ˆ Q UXY h D ( ˆ Q UXY k P UXY )+ min Q UX | Y ∈L ( ˆ Q UXY ,R ,R ,T ) Φ( Q UX | Y ˆ Q Y , R , R ) i (27) with L ( ˆ Q UXY , R , R , T ):= n Q UX | Y : E Q ln 1 W Y + s ( ˆ Q UY , R ) − T ≤ ∆( Q, R , R ) o (28) where Q in (28) is equal to Q = Q UX | Y ˆ Q Y , and s ( ˆ Q UY , R ):= − min ˜ Q X | UY : β ( ˜ Q,R ) ≤ (cid:2) β ( ˜ Q, R ) − E ˜ Q ln W Y (cid:3) (29) and where ˜ Q in (29) is equal to ˜ Q = ˜ Q X | UY ˆ Q UY .For constant composition random codes, the error exponents E t2 and E u2 are given by E t2 ( R , R , T )= max (cid:8) e Ψ a ( R , R , T ) , e Ψ c ( R , R , T ) (cid:9) , (30) E u2 ( R , R , T )= E t2 ( R , R , T ) + T, (31) where e Ψ c ( R , R , T ):= min ˆ Q UXY : ˆ Q UX = P UX h D ( ˆ Q UXY k P UXY )+ min Q UX | Y ∈ e L ( ˆ Q UXY ,R ,R ,T ) Φ( Q UX | Y ˆ Q Y , R , R ) i (32) with e L ( ˆ Q UXY , R , R , T ):= n Q UX | Y : Q UX = P UX , E Q ln 1 W Y + ˜ s ( ˆ Q UY , R ) − T ≤ ∆( Q, R , R ) o (33) where Q in (33) is equal to Q = Q UX | Y ˆ Q Y , Q UX in (33) isthe ( U × X ) -marginal distribution of Q and ˜ s ( ˆ Q UY , R ):= − min ˜ Q X | UY : β ( ˜ Q,R ) ≤ , ˜ Q UX = P UX (cid:2) β ( ˜ Q, R ) − E ˜ Q ln W Y (cid:3) (34) and where ˜ Q in (34) is equal to ˜ Q = ˜ Q X | UY ˆ Q UY and ˜ Q UX in (34) is the ( U × X ) -marginal distribution of ˜ Q . The proof of Theorem 2 is provided in Section VII.

B. Discussion of Main Results

A few remarks on the theorems above are in order. • Eqn. (17) in Theorem 1 implies that the optimal decoderfor the pair of messages ( m , m ) (i.e., D ∗ m m deﬁnedin (10)) achieves the optimal trade-off between the totaland undetected error exponents pertaining to the privatemessage m . This observation is non-trivial and notimmediately obvious. When Y wishes to decode onlythe private message m , the optimal decoder for the pair of messages ( m , m ) , called the joint decoder ,declares the message ˆ m of the decoded message pair ( ˆ m , ˆ m ) is the ﬁnal output. It is not clear that thisdecoding strategy is optimal error exponent-wise. Themain difference between the error events for these twodecoders is that the user Y can decode the correct privatemessage m but the wrong common message m . This isan error event for the joint decoder (but not for the onethat focuses only on m ). However, Lemma 7 impliesthat on the exponential scale, the exponents of the twodecoders are the same, i.e., there is no loss in optimalityin using the joint decoder for decoding only message m . • One of our key technical contributions is Lemma 7 (tofollow). This lemma allows us to simplify the calculationof the exponents by disentangling the statistical depen-dencies between “satellite codewords” that share the samecloud center. In particular, when we take into accountthe fact that the “cloud centers” C ′ U (of which there areexponentially many) are random, this lemma allows usto decouple the dependence between two key randomvariables F = X m ′ ∈M \{ } X m ′ ∈M \{ } W n Y ( y n | X n ( m ′ , m ′ )) , (35)and F = X m ′ ∈M \{ } W n Y ( y n | X n (1 , m ′ )) (36)which are on different sides of a fundamental errorprobability (see (63) and (95) in the proof of Theorem 1 in Section VI). In contrast, for the analysis of the inter-ference channel in [23] and [24], only an upper bound of the error probability is sought. This upper bound isnot necessarily exponentially tight. On the other hand,the use of Lemma 7 incurs no loss in optimality onthe exponential scale when appropriately combined withLemma 6. • In an elegant work in [18], Merhav showed that forordinary channel coding, independent random selectionof codewords within a given type class together withsuboptimal bin index decoding (which is based on ordi-nary maximum likelihood decoding), performs as well asoptimal bin index decoding in terms of the error exponentachieved. Furthermore, Merhav showed that for constantcomposition random codes with superposition coding andoptimal decoding, the conclusion above no longer holdsin general. In this paper, we show that for i.i.d. andconstant composition random codes with superpositioncoding and erasure decoding, the conclusion holds forthe case of decoding the “satellite” codewords. That isthe (in general) suboptimal decoding of the “satellite”codewords achieves same random coding error exponentas the optimal decoding of the “satellite” codewords (seeTheorem 1). • In Theorem 1, the total error exponent for the privatemessage m is the minimum of two exponents Ψ a and Ψ b . The ﬁrst exponent Ψ a intuitively means that the user Y is in a regime where it decodes the pair of messages ( ˆ m , ˆ m ) . Loosely speaking, the second exponent Ψ b means that user Y knows the true common message ˜ m (given by a genie), then decodes the “satellite” codeword X n ( m , ˜ m ) . In contrast to the single-user DMC case,now every codeword is generated according to a condi-tional probability distribution P X | U . Thus all codewordsare conditioned on a particular u n ( ˜ m ) sequence ratherthan being generated according to a marginal distribution P X . This is also reﬂected in the expression of the inneroptimization in (20) which is averaged over the randomvariable U (see deﬁnition of β ( · ) in (13)). • In this work, while it seems natural, we do not considerthe list decoding mode in which

T < due to a coupleof technical reasons. To ensure that e − n ( T + R ) in (88)vanishes, Lemma 6 holds on the condition that T > − R ,rather than the more general T < . Furthermore, Lemma7, which is crucial in removing the dependence betweentwo key random variables F and F , requires that T ≥ due to the derivation of (184). It appears to the authorsthat relaxations of the conditions on T in Lemmas 6and 7 would be rather involved and so we defer theconsideration of the list decoding mode to future work. • It is clear from the closed-form expressions of the expo-nents in Theorems 1 and 2 that the constant compositionones are at least as large as their i.i.d. counterparts. InSection V-B, we present a numerical example to show thatthis inequality can be strict . Furthermore, if the broadcastchannel is degraded in favor of Y (i.e., X ⊸ −− Y ⊸ −− Z forms a Markov chain in this order), the error exponentsat Z are smaller than that at Y . We also verify this numerically in Section V-C. • Finally, for the case in which user Y wishes to decodethe common message m , the intuition gleaned fromTheorem 2 is that if the decoding is not correct, bothevents { F ′ ≥ f e − nT } and { F ′ ≥ F e − nT } shouldoccur (see (140)). This also means that Y can takeone of two actions. First, decode the true transmit-ted codeword X n ( m , m ) to identify m when thecomplement of the ﬁrst event (i.e., { F ′ ≤ f e − nT } )occurs; this corresponds to the exponent Ψ a . Second itcan decode the sub-codebook for the common message C ′ ( m ) := { X n ( m , m ) : m ∈ [ M ] \ { }} to identify m when the second event { F ′ ≤ F e − nT } occurs; thiscorresponds to Ψ c . This explains the maximization in theﬁrst expression in (25). When R is large, the term Ψ c in (27) of Theorem 2 implies that Y is more likely thannot to decode the “cloud center” U n ( m ) according to the“test channel” W Y | U ( y | u ) := P x W Y ( y | x ) P X | U ( x | u ) .This corresponds to the second decoding strategy, i.e.,decoding the entire sub-codebook C ′ ( m ) indexed by m .Also see Remark 1 to follow.IV. E VALUATING THE E XPONENTS VIA C ONVEX O PTIMIZATION

In this section, we ﬁrst consider i.i.d. random codes. Toevaluate E t1 in Theorem 1, we need to devise an efﬁcientnumerical procedure to solve the minimization problems Ψ a and Ψ b . As will be shown below, these problems can be solvedefﬁciently even though they are not convex.For the second term Ψ b in (20), we can split the feasi-ble region of the inner minimization, i.e., L ( ˆ Q UXY , R , T ) (see (22)), into two closed sets, namely L ( ˆ Q UXY ) := L ( ˆ Q UXY , R , T ) ∩ B ( ˆ Q UY , R ) and L ( ˆ Q UXY ) := L ( ˆ Q UXY , R , T ) ∩ B ( ˆ Q UY , R ) , where B ( ˆ Q UY , R ) := { Q X | UY : β ( Q X | UY ˆ Q UY , R ) ≥ } (37) B ( ˆ Q UY , R ) := { Q X | UY : β ( Q X | UY ˆ Q UY , R ) ≤ } . (38)We denote the corresponding minimization problems pertain-ing to Ψ b in (20) (and (22)) in which the function | · | + isinactive or active as Ψ b1 and Ψ b2 , respectively, i.e., Ψ b1 := min ˆ Q UXY h D ( ˆ Q UXY k P UXY )+ min Q X | UY ∈L ( ˆ Q UXY ) β ( Q X | UY ˆ Q UY , R ) i (39) Ψ b2 := min ˆ Q UXY : L ( ˆ Q UXY ) = ∅ D ( ˆ Q UXY k P UXY ) , (40)where the sets L and L are deﬁned as L ( ˆ Q UXY ):= n Q X | UY : E Q ln 1 W Y + E ˆ Q ln W Y − T ≤ ,β ( Q, R ) ≥ o , (41) L ( ˆ Q UXY ):= n Q X | UY : E Q ln 1 W Y + E ˆ Q ln W Y − T + β ( Q, R ) ≤ ,β ( Q, R ) ≤ o , (42) and where Q in (41) and (42) is equal to Q = Q X | UY ˆ Q UY .We thus have Ψ b = min { Ψ b1 , Ψ b2 } .As the minimization problem Ψ b2 is convex, it can be solvedefﬁciently. However Ψ b1 is non-convex due to the non-convexconstraint β ( Q X | UY ˆ Q UY ) ≥ in the inner optimization. Forthe inner optimization, if we remove this constraint in L , themodiﬁed problem is Ψ ′ b1 := min ˆ Q UXY h D ( ˆ Q UXY k P UXY )+ min Q X | UY ∈L ′ ( ˆ Q UXY ) β ( Q X | UY ˆ Q UY ) i , (43)where L ′ ( ˆ Q UXY ):= n Q X | UY : E Q ln 1 W Y + E ˆ Q ln W Y − T ≤ o , (44)is convex and can be solved efﬁciently. Furthermore, we havethe following proposition. Proposition 3.

For the optimization problem Ψ b1 , if theoptimal solution to the inner optimization of the modiﬁedproblem Ψ ′ b1 is not feasible for the original problem Ψ b1 ,i.e., β ( Q X | UY ˆ Q UY ) < , then there exists an optimal solu-tion to the original inner optimization problem that satisﬁes β ( Q X | UY ˆ Q UY ) = 0 . Moreover, in this case, the optimal valueof Ψ b is equal that for Ψ b2 (i.e., Ψ b2 is active in the minimumthat deﬁnes Ψ b ).Proof of Proposition 3: See Appendix A.In summary, we can solve the non-convex optimizationproblem Ψ b by solving two convex problems Ψ b2 and Ψ ′ b1 ,i.e., Ψ b = min { (Ψ ′ b1 ) + , Ψ b2 } , (45)where the superscript “ + ” of (Ψ ′ b1 ) + means the value of (Ψ ′ b1 ) + is active in the minimization if the optimal solution ( ˆ Q ∗ , Q ∗ ) is also feasible for the original optimization Ψ b1 ,i.e., β ( Q ∗ X | UY ˆ Q ∗ UY ) ≥ . In other words, Ψ b = ( min { Ψ ′ b1 , Ψ b2 } β ( Q ∗ X | UY ˆ Q ∗ UY ) ≥ b2 else . (46)Consequently, Ψ b can be solved efﬁciently.For Ψ a in (19), let Ω( Q UX | Y ˆ Q Y ):= E Q UX | Y ˆ Q Y ln 1 W Y + E ˆ Q ln W Y − T, (47)then similarly, we can partition the feasible region of the innerminimization into four parts and denote the correspondinginner optimization problems as follows:1) If γ ( Q ) ≥ and β ( Q ) ≥ , then Φ ∗ a1 := min Q γ ( Q ) + β ( Q ) , such that Ω( Q ) ≤ . (48) In this section, we drop the dependences of β ( · ) and γ ( · ) on the rates R and R

2) If γ ( Q ) ≥ and β ( Q ) ≤ , then Φ ∗ a2 := min Q γ ( Q ) , such that Ω( Q ) + β ( Q ) ≤ . (49)3) If γ ( Q ) ≤ and γ ( Q ) + β ( Q ) ≥ , then Φ ∗ a3 := min Q γ ( Q ) + β ( Q ) , such that Ω( Q ) ≤ . (50)4) If γ ( Q ) ≤ and γ ( Q ) + β ( Q ) ≤ , then Φ ∗ a4 := 0 such that Ω( Q ) + γ ( Q ) + β ( Q ) ≤ . (51)where Q in the above deﬁnitions is equal to Q = Q UX | Y ˆ Q Y (compare the above to the deﬁnition of the optimizationproblem Ψ a in (19)). Thus we have, Ψ a = min ˆ Q UXY (cid:2) D ( ˆ Q UXY k P UXY ) + min i ∈ [4] { Φ ∗ a i ( ˆ Q XY ) } (cid:3) . (52)We can rewrite the objective functions of Φ ∗ a1 and Φ ∗ a3 asfollows min Q UX | Y γ ( Q UX | Y ˆ Q Y ) + β ( Q UX | Y ˆ Q Y )= min Q U | Y [ γ ( Q U | Y ˆ Q Y ) + min Q X | UY β ( Q X | UY Q U | Y ˆ Q Y )] (53)where the notation γ ( Q U | Y ˆ Q Y ) is consistent due to the factthat the function γ ( Q, R ) (see (14)) only depends on themarginal distribution Q UY . Therefore, by using a similarargument as that for Ψ b1 above, we can remove the non-convex constraint β ( Q ) ≥ in Φ ∗ a1 due to Φ ∗ a2 . We canalso remove the non-convex constraint γ ( Q ) + β ( Q ) ≥ in Φ ∗ a3 due to Φ ∗ a4 . Denote these two modiﬁed optimizations as Φ ′ a1 and Φ ′ a3 , respectively. We can merge these two modiﬁedoptimizations Φ ′ a1 and Φ ′ a3 into a new convex optimizationproblem Φ ∗ a5 i.e., Φ ∗ a5 := min γ ( Q ) + β ( Q ) such that Ω( Q ) ≤ . (54)We now state and prove a proposition that simpliﬁes thecalculation of (52). Proposition 4.

For the inner minimization problem in (52) ,i.e., min i ∈ [4] { Φ ∗ a i ( ˆ Q XY ) } , without loss of optimality, we canreplace Φ ∗ a1 and Φ ∗ a3 with the new convex optimization prob-lem Φ ∗ a5 .Proof of Proposition 4: See Appendix B.For the second term Ψ a2 , we can also remove the non-convex constraint γ ( Q ) ≥ in Φ ∗ a2 due to Φ ∗ a4 . Therefore, wecan solve the minimization problem Ψ a in (19) efﬁciently, asthe remaining case Φ ∗ a4 is a convex minimization problem.Similarly to the above, we can also efﬁciently calculate E t2 in Theorem 2 as s ( ˆ Q, R ) is a convex minimization problem.Finally, for constant composition random codes, since theadditional marginal constraints are linear, the transformed opti-mization problems remain convex and can be solved efﬁcientlyas we show in Section V-B. E rr o r E x ponen t Fig. 4. Total error exponents E t1 and E t2 as a function of R for two differentvalues of R and where the threshold T = 0 . E rr o r E x ponen t Fig. 5. Total error exponents E t1 and E t2 as a function of R for two differentvalues of R and where the threshold T = 0 . V. N

UMERICAL E VALUATIONS

In this section, we present numerical examples to illustratethe following. • The behavior of the i.i.d. exponents in Theorems 1 and 2; • The comparison between the constant composition andthe i.i.d. error exponents in Theorem 1; • The comparison between the i.i.d. error exponents formessage m at terminals Y and Z .We consider binary symmetric channels (BSCs): Y = X ⊕ V and Z = X ⊕ V , where X, Y, V , V ∈ { , } , V ∼ Bern ( p ) and V ∼ Bern ( p ) . Let U be binary as well and U ∼ Bern (0 . . Also, let X = U ⊕ V , where V ∈ { , } and V ∼ Bern ( q ) . In this example, we ﬁx p = 0 . , p = 0 . and q = 0 . , and all the rates are in nats.All the Matlab ® A. Behavior of i.i.d. Exponents

Firstly, we consider the i.i.d. case in which T = 0 . Weobtain a three-dimensional exponent-rate region ( E t j , R , R ) E rr o r E x ponen t Fig. 6. Total error exponent E t1 and undetected error exponent E u1 formessage m as a function of T for two different pairs of ( R , R ) . E rr o r E x ponen t Fig. 7. Total error exponent E t2 and undetected error exponent E u2 formessage m as a function of T for two different pairs of ( R , R ) . for decoding ( m , m ) . To obtain a two-dimensional plot, weconsider projections: Fix one rate and vary the other rate andplot the error exponent E t j , j = 1 , . Figure 4 shows oneprojection for R = 0 . and R = 0 . nats/channel use. Formessage m , the range of R for which E t1 > (i.e., R < . for R = 0 . and R = 0 for R = 0 . ) coincides withthat for the set of achievable rate pairs ( R , R ) correspondingto our choice of input distribution P UX for decoding onlymessage m , namely { R ≤ I ( X ; Y | U ) = 0 . } \ { R + R ≤ I ( X ; Y ) = 0 . } . (55)Moreover, we see that E t1 for a ﬁxed R is horizontal for R below a critical value and curved for R above this value. Formessage m , the range of R for which E t2 > (i.e., R < The rate region in (55) and (56) can be obtained by applying the packinglemma in [19, Lemma 3]. Also see [19, Sec. 5.3.1] for a similar analysis ofthe superposition coding inner bound. E rr o r E x ponen t Fig. 8. Constant composition and i.i.d. total error exponents ˜ E t1 and E t1 formessage m as a function of T for two different pairs of ( R , R ) . . for R = 0 . and R < . for R = 0 . ) coincidesthat for the set of achievable rate pairs ( R , R ) correspondingto our choice of input distribution P UX for decoding only m ,i.e., { R ≤ I ( U ; Y ) = 0 . } [ { R + R ≤ I ( X ; Y ) = 0 . } . (56)Figure 5 shows the other projection for R = 0 . and R = 0 . nats/channel use. It also can be checked that therange of R for both messages ( m , m ) coincides with (55).When R = 0 . ≤ I ( U ; Y ) , we see the curve of E t2 rapidly decreases for R below a critical value and remainshorizontal for R above the critical value. This is becausewhen R = 0 . ≤ I ( U ; Y ) , the rate pair ( R , R ) is alwaysachievable, i.e., ( R , R ) belongs to the region deﬁned in (56).When R = 0 . ≥ I ( U ; Y ) , we observe that the two errorexponents E t1 and E t2 are equal.Figures 6 and 7 illustrate the optimal trade-off betweenthe i.i.d. total error exponent and the i.i.d. undetected errorexponent as function of T for two different pairs of ( R , R ) .We observe that for both messages, the total error exponentdecreases and the undetected error exponent increases whenthe threshold T increases. We also observe that the smallestthreshold T for which the total error exponent is zero dependson the rate pair ( R , R ) and decreases as either rate increases. B. Gain of Constant Composition Exponents over i.i.d. Ones

We now demonstrate the gain of the constant compositionexponents over the i.i.d. ones in Theorem 1. Denote theconstant composition and i.i.d. total error exponents for m as ˜ E t1 and E t1 respectively. For the example of BSCs described atthe start of this section, Figure 8 displays these exponents asfunctions of T for two different pairs of ( R , R ) . We observethat the constant composition exponents are strictly larger thantheir i.i.d. counterparts. E rr o r E x ponen t Fig. 9. Total and undetected error exponents E u2 and E t2 at terminal Y and E u2 , Z and E t2 , Z at terminal Z for message m as a function of T for a givenrate pair ( R , R ) = (0 . , . . C. Comparison of i.i.d. Exponents at Two Terminals

Finally, we consider the relationship between the exponentsat the two terminals. We denote the i.i.d. total and undetectederror exponent for m at terminal Z as E t2 , Z and E u2 , Z respectively. Figure 9 compares the optimal trade-off betweenthese exponents as functions of T for terminals Y and Z . Weobserve that similar to the standard decoding, if the channelquality is worse, this leads to a smaller exponent for thedecoding with erasure option (and vice versa).VI. P ROOF OF T HEOREM Proof of Theorem 1:

Firstly, we consider i.i.d. randomcodes. At the end of the proof, we describe how to extendthe analysis to constant composition codes. Assume, withoutloss of generality, that the true transmitted message pairis ( m , m ) = (1 , . Denote the random sub-codebook { U n ( m ′ ) : m ′ ∈ M \ { }} as C ′ U , and the (total) errorevent E as E := (cid:26) X m ′ =1 Pr( Y n |C ( m ′ )) > Pr( Y n |C (1)) e − nT (cid:27) . (57)Given the optimal decoding region D ∗ m in (9), by using thelaw of total probability, the average total error probability formessage m = 1 is E C [ e t1 (1 , E ( U n (1) ,X n (1 , ,Y n ) h E C ′ U (cid:2) Pr {E | ( U n (1) , X n (1 , , Y n ) , C ′ U } (cid:3)i , (58)Next, we calculate the error probability given ( U n (1) , X n (1 , , Y n ) = ( u n , x n , y n ) with joint type ˆ Q UXY and the sub-codebook C ′ U = c ′ U . For brevity, deﬁnethe quantities F := X m ′ ∈M \{ } X m ′ ∈M \{ } W n Y ( y n | X n ( m ′ , m ′ )) (59) F := X m ′ ∈M \{ } W n Y ( y n | X n ( m ′ , (60) f := W n Y ( y n | x n ) (61) F := X m ′ ∈M \{ } W n Y ( y n | X n (1 , m ′ )) . (62)Note that f is a deterministic quantity given ( X n (1 , , Y n ) = ( x n , y n ) while the others are random.These deﬁnitions allow us to express Pr {E } compactly as Pr {E } = Pr (cid:8) F + F > ( f + F ) · e − nT (cid:9) . (63)Let Q U | Y , Q UX | Y and Q X | UY be conditional types such that Q U | Y ˆ Q Y , Q UX | Y ˆ Q Y and Q X | UY ˆ Q UY are joint types deﬁnedon U × Y , U × X × Y and

U × X × Y , respectively. Deﬁnethe following quantities Λ( Q UY , C ′ U ) := (cid:12)(cid:12)(cid:8) U n ( m ) : m ∈ M \ { } , ( U n ( m ) , y n ) ∈ T Q UY (cid:9)(cid:12)(cid:12) , (64) N m ( Q UXY ) := (cid:12)(cid:12)(cid:8) X n ( m , m ) : m ∈ M \ { } , ( u n ( m ) , X n ( m , m ) , y n ) ∈ T Q UXY (cid:9)(cid:12)(cid:12) for all m ∈ M , (65) N ( Q UXY ) := (cid:12)(cid:12)(cid:8) X n ( m ,

1) : m ∈ M \ { } , ( X n ( m , , u n , y n ) ∈ T Q UXY (cid:9)(cid:12)(cid:12) . (66)which represent the number of codewords X n ( m , m ) (resp. U n ( m ) ) whose joint types with the corresponding “cloudcenters” u n ( m ) and the received sequence y n (resp. and onlythe received sequence y n ) are Q UXY (resp. Q UY ), i.e.,Note that Λ( Q U | Y ˆ Q Y , c ′ U ) is a deterministic quantity given Y n = y n and a ﬁxed C ′ U = c ′ U . However, if we take intoaccount the fact that C ′ U is a collection of random variables,then Λ( Q UY , C ′ U ) is a random variable given Y n = y n .Now, recall the i.i.d. and constant composition randomcodebook generation procedures (see Section II-B) and thedeﬁnitions of β ( Q UXY , R ) and γ ( Q UXY , R ) (see (13)and (14)). Then, Λ( Q U | Y ˆ Q Y , C ′ U ) , N i ( Q UX | Y ˆ Q Y ) , i ∈ M and N ( Q X | UY ˆ Q UY ) possess the following properties: Fact 1.

1) For a given Q U | Y , Λ( Q U | Y ˆ Q Y , C ′ U ) is a bi-nomial random variable with ( e nR − trials and“success” probability (cid:12)(cid:12) T Q U | Y ( y n ) (cid:12)(cid:12)(cid:12)(cid:12) T Q U (cid:12)(cid:12) · P U n (cid:0) T Q U (cid:1) . = e − n [ D ( Q U k P U )+ I QUY ( U ; Y )] = e − n [ γ ( Q UY ,R )+ R ] , (67) In the following analysis, for ease of notation, we drop the conditioningevents { U n (1) , X n (1 , , Y n ) = ( u n , x n , y n ) } and {C ′ U = c ′ U } whenthere is no possibility of confusion. In Fact 1, P U n ,X n (in (67), (68) and (69)) either denotes the i.i.d.distribution deﬁned in (1) and (2) or the uniform distribution over the typeclass T P UX . See discussion in Section II-B. where Q UY = Q U | Y ˆ Q Y . Note that the notation γ ( Q UY , R ) is consistent since the function γ ( Q, R ) (see (14) ) only depends on the marginal distribution Q UY .2) For a given Q UX | Y , N m ( Q UX | Y ˆ Q Y ) , m ∈M are i.i.d. binomial random variables each with Λ( Q U | Y ˆ Q Y , c ′ U ) trials and “success” probability (cid:12)(cid:12) T Q X | UY (˜ u n , y n ) (cid:12)(cid:12)(cid:12)(cid:12) T Q X | U (˜ u n ) (cid:12)(cid:12) · P X n | U n (cid:0) T Q X | U (˜ u n ) | ˜ u n (cid:1) . = e − n [ D ( Q X | U k P X | U | Q U )+ I QUXY ( X ; Y | U )] = e − n [ β ( Q UXY ,R )+ R ] , (68) where Q UXY = Q UX | Y ˆ Q Y and (˜ u n , y n ) ∈ T Q UY .3) For a given Q X | UY , N ( Q X | UY ˆ Q UY ) is a binomialrandom variable with ( e nR − trials and “success”probability (cid:12)(cid:12) T Q X | UY ( u n , y n ) (cid:12)(cid:12)(cid:12)(cid:12) T Q X | U ( u n ) (cid:12)(cid:12) · P X n | U n (cid:0) T Q X | U ( u n ) | u n (cid:1) . = e − n [ D ( Q X | U k P X | U | ˆ Q U )+ I QUXY ( X ; Y | U )] = e − n [ β ( Q UXY ,R )+ R ] , (69) where Q UXY = Q X | UY ˆ Q UY . By using a standard large deviations analysis, we obtainthe following proposition which is useful to analyze theconcentration properties of the random variables deﬁned in(64)–(66).

Proposition 5.

Suppose V i , i = 1 , . . . , e nr , where r > , arei.i.d. Bernoulli random variables with E [ V i ] = e − np , where p > . We have1) The probability of the event { P e nr i =1 V i ≥ } is Pr ( e nr X i =1 V i ≥ ) . = e − n | p − r | + . (70)

2) Let a = | r − p | + + ǫ ∈ (0 , r ) where ǫ > , thenthe probability of the event { ln P e nr i =1 V i ≥ na } decaysdoubly exponentially, i.e., Pr ( e nr X i =1 V i ≥ e na ) ≤ exp {− e na [ n ( p + a − r ) − } . (71)

3) Assume r > p and let a = r − p − ǫ > where ǫ > , thenthe probability of the event { ln P e nr i =1 V i ≤ na } decaysdoubly exponentially, i.e., Pr ( e nr X i =1 V i ≤ e na ) ≤ exp {− e na ( e nǫ − nǫ − } . (72) Proof of Proposition 5:

Part 1 follows from a clippedversion of Markov’s inequality. See the derivation of [13,Eqn. (41)]. Parts 2) and 3) follows by applying the Chernoffbound. See [25, Appendix B].Base on Fact 1 and Proposition 5, we can derive thefollowing lemma which is essential in handling the statistical dependence between F and F . Note that, by deﬁnition, theserandom variables share the same “cloud centers”. Lemma 6.

Given Y n = y n , C ′ U = c ′ U and T ≥ , for n sufﬁciently large, we have Pr (cid:8) F ≤ e − nT · F (cid:12)(cid:12) Y n = y n , C ′ U = c ′ U (cid:9) ≤ e − n · R / . (73) Proof:

Let C ′ X := { X n ( m , m ) : m ∈ M , m ∈M \ { }} , and deﬁne f ( Q UXY ) := − E Q UXY [ln W Y ( Y | X )] . (74)Recall the deﬁnitions of Λ( Q UY , c ′ U ) and N i ( Q UXY ) (see(64) and (65)) and let Q UXY = Q U | Y Q X | UY ˆ Q Y and Q ′ UXY = Q ′ U | Y Q ′ X | UY ˆ Q Y , we have Pr { F ≤ e − nT F | Y n = y n , C ′ U = c ′ U } = X c ′ X Pr {C ′ X = c ′ X | Y n = y n , C ′ U = c ′ U }× ( X Q U | Y X Q X | UY e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≤ e − nT X Q ′ U | Y X Q ′ X | UY N ( Q ′ UXY ) e − nf ( Q ′ UXY ) ) . (75)Deﬁne the sets Q ( y n , c ′ U ) := { Q U | Y : Λ( Q U | Y ˆ Q Y , c ′ U ) ≥ } , (76)and Q ( Q U | Y , y n , c ′ U , c ′ X ) := { Q X | UY : ∃ i ∈ M , s.t. N i ( Q X | UY Q U | Y ˆ Q Y ) ≥ } . (77)We have the chain of inequalities (78)–(81) on the top of thenext page, where (78) is from the fact that Λ( Q U | Y ˆ Q Y , c ′ U ) and N i ( Q UXY ) , i ∈ M , are non-negative integers, (79) isdue to the fact that P ki =1 a i ≤ P ki =1 b i implies that thereexists an i ∈ { , . . . , k } such that a i ≤ b i , and the last indi-cator function { N ( Q UXY ) ≥ } in (81) is present becausewhen N ( Q UXY ) = 0 , we have P e nR i =2 N i ( Q UXY ) ≥ since Q X | UY ∈ Q ( Q U | Y , y n , c ′ U , c ′ X ) and Q U | Y ∈ Q ( y n , c ′ U ) .Therefore, combining (75) and (81), we have: Pr (cid:8) F ≤ e − nT F (cid:12)(cid:12) Y n = y n , C ′ U = c ′ U (cid:9) ≤ X Q U | Y X Q X | UY E C ′ X (cid:20) (cid:26) e nR X i =2 N i ( Q UXY ) ≤ e − nT N ( Q UXY ) (cid:27) × { Λ( Q U | Y ˆ Q Y , c ′ U ) ≥ } { N ( Q UXY ) ≥ } (cid:12)(cid:12)(cid:12)(cid:12) Y n = y n , C ′ U = c ′ U (cid:21) (82)Moreover, let A n , P e nR i =2 N i ( Q UXY ) . From Part 2 ofFact 1, we know that A n is a binomial random variable with Λ( Q U | Y ˆ Q Y , c ′ U )( e nR − trials and ”success” probabilitywhere the corresponding exponent is ( β ( Q UXY ) + R ) .There are two cases for the exponent of the expec-tation of A n , i) lim inf n →∞ n ln E [ A n ] > , and ii) lim inf n →∞ n ln E [ A n ] ≤ . For the ﬁrst case, we know that for sufﬁciently large n , B n := 1 n ln Λ( Q U | Y ˆ Q Y , c ′ U ) − β ( Q UXY , R ) > (83)uniformly. Then using Parts 2 and 3 of Proposition 5, for anysufﬁciently small ǫ ∈ (0 , B n ) , we have Pr (cid:26) n ln A n ≤ B n − ǫ [ n ln A n ≥ B n + ǫ (cid:27) ≤ (cid:8) − e n ( B n − ǫ ) (cid:9) . (84)In other words, A n concentrates doubly exponentially fastaround its expectation E [ A n ] .Therefore, using a similar derivation as in [18, Eqns. (36)–(39)], for any sufﬁciently small ǫ > , we have the chain ofinequalities (85)–(88) on the top of the next page, where (86)is due to (84) and the fact that ǫ can be made arbitrarilysmall, (87) is due to the fact that N i ( Q UXY ) , i ∈ M are i.i.d.(see Part 2 of Fact 1) and (88) is due to Markov’s inequality.For the second case in which E [ A n ] is not exponentiallylarge, we also have that E (cid:20) e nR X i =2 N i ( Q UXY ) (cid:21) = ( e nR − E (cid:2) N ( Q UXY ) (cid:3) . (89)Thus, we have lim inf n →∞ − n ln E (cid:2) N ( Q UXY ) (cid:3) ≥ R . (90)Furthermore, for sufﬁciently large n , by using (90) andMarkov’s inequality, we have Pr (cid:8) N ( Q UXY ) ≥ (cid:12)(cid:12) Y n = y n , C ′ U = c ′ U (cid:9) ≤ e − nR / (91)Therefore, combining (82), (88) and (91), for sufﬁciently large n , we have Pr (cid:8) F ≤ e − nT F (cid:12)(cid:12) Y n = y n , C ′ U = c ′ U (cid:9) ≤ e − nR / . (92)This concludes the proof of Lemma 6.Now, we use Lemma 6 to prove the following lemmawhich eliminates F from the probability of interest, removesthe dependence between F and F , and also simpliﬁes thecalculation of Pr {E } (see (63)). Lemma 7.

For given ( U n (1) , X n (1 , , Y n ) = ( u n , x n , y n ) , C ′ U = c ′ U and T ≥ , we have Pr (cid:8) F + F > max { f , F } · e − nT (cid:9) . = max (cid:8) Pr { F > f · e − nT } , Pr { F > f · e − nT } (cid:9) . (93) Proof of Lemma 7:

See Appendix C.Now, we continue the proof of Theorem 1 by usingLemma 7. Recall the error probability in (63). Note that Pr {E } . = Pr (cid:8) F + F > max { f , F } · e − nT (cid:9) . (94)By using Lemma 7, we have Pr {E } . = max (cid:8) Pr { F > f · e − nT } , Pr { F > f · e − nT } (cid:9) . (95)  X Q U | Y X Q X | UY e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≤ e − nT X Q ′ U | Y X Q ′ X | UY N ( Q ′ UXY ) e − nf ( Q ′ UXY )  = ( X Q U | Y ∈Q ( y n ,c ′ U ) X Q X | UY ∈Q ( Q U | Y ,y n ,c ′ U ,c ′ X ) e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≤ e − nT X Q ′ U | Y ∈Q ( y n ,c ′ U ) X Q ′ X | UY ∈Q ( Q ′ U | Y ,y n ,c ′ U ,c ′ X ) N ( Q ′ UXY ) e − nf ( Q ′ UXY ) ) (78) ≤ X Q U | Y ∈Q ( y n ,c ′ U ) X Q X | UY ∈Q ( Q U | Y ,y n ,c ′ U ,c ′ X ) ( e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≤ e − nT N ( Q UXY ) e − nf ( Q UXY ) ) (79) = X Q U | Y ∈Q ( y n ,c ′ U ) X Q X | UY ∈Q ( Q U | Y ,y n ,c ′ U ,c ′ X )  e nR X i =2 N i ( Q UXY ) ≤ e − nT N ( Q UXY )  (80) = X Q U | Y X Q X | UY  e nR X i =2 N i ( Q UXY ) ≤ e − nT N ( Q UXY )  { Λ( Q U | Y ˆ Q Y , c ′ U ) ≥ } { N ( Q UXY ) ≥ } (81) Pr  e nR X i =2 N i ( Q UXY ) ≤ e − nT N ( Q UXY ) (cid:12)(cid:12)(cid:12)(cid:12) Y n = y n , C ′ U = c ′ U  ≤ ( R + R ) /ǫ X j =0 Pr (cid:26) jǫ ≤ n ln A n ≤ ( j + 1) ǫ (cid:27) Pr (cid:8) e njǫ ≤ e − nT N ( Q UXY ) (cid:12)(cid:12) Y n = y n , C ′ U = c ′ U (cid:9) (85) . = Pr  E  e nR X i =2 N i ( Q UXY )  ≤ e − nT N ( Q UXY ) (cid:12)(cid:12)(cid:12)(cid:12) Y n = y n , C ′ U = c ′ U  (86) . = Pr n N ( Q UXY ) ≥ e nT e nR E [ N ( Q UXY )] (cid:12)(cid:12)(cid:12) Y n = y n , C ′ U = c ′ U o (87) ≤ e − n ( T + R ) , (88)Next, we consider the ﬁrst term in the right-hand-side of (95).Recall the deﬁnitions of f ( Q UXY ) and Q ( y n , c ′ U ) in theproof of Lemma 6 (see (74) and (76)) and let s := − n ln( f · e − nT ) = f ( ˆ Q UXY ) + T. (96)Now, let Q UXY = Q X | UY Q U | Y ˆ Q Y , we have the chain ofexponential equalities (97)–(102) on the top of the next page,where the interchange of max {·} and Pr {·} in (99) is justiﬁedsimilarly as [13, Eqn. (37)] and [18, Eqns. (15)–(20)].Using Part 2 of Fact 1, for a given C ′ U = c ′ U , we evaluatethe inner probability in (102) as follows.1) The case f ( Q UXY ) − s ≤ . Note that N i ( Q UXY ) , i ∈ M \ { } , are non-negative integers. Using Part 1of Proposition 5, we have Pr (cid:26) e nR X i =2 N i ( Q UXY ) ≥ e n [ f ( Q UXY ) − s ] (cid:12)(cid:12)(cid:12)(cid:12) C ′ U = c ′ U (cid:27) = Pr (cid:26) e nR X i =2 N i ( Q UXY ) ≥ (cid:12)(cid:12)(cid:12)(cid:12) C ′ U = c ′ U (cid:27) (103) . = exp (cid:26) − n (cid:12)(cid:12)(cid:12) β ( Q UXY ) − n ln Λ( Q U | Y ˆ Q Y , c ′ U ) (cid:12)(cid:12)(cid:12) + (cid:27) . (104)2) The case f ( Q UXY ) − s > (cid:12)(cid:12) n ln Λ( Q U | Y ˆ Q Y , c ′ U ) − β ( Q UXY ) (cid:12)(cid:12) + . Using Part 2 of Proposition 5, for sufﬁ-ciently large n , we have Pr (cid:26) e nR X i =2 N i ( Q UXY ) ≥ e n [ f ( Q UXY ) − s ] (cid:12)(cid:12)(cid:12)(cid:12) C ′ U = c ′ U (cid:27) ≤ exp (cid:8) − e n [ f ( Q UXY ) − s ] (cid:9) . (105)This term decays at least doubly exponentially rapidlyand hence its exponent is inﬁnity.3) The case < f ( Q UXY ) − s < (cid:2) n ln Λ( Q U | Y ˆ Q Y , c ′ U ) − β ( Q UXY ) (cid:3) . Using Part 3 of Proposition 5, for sufﬁcientlylarge n , we have Pr (cid:26) e nR X i =2 N i ( Q UXY ) ≥ e n [ f ( Q UXY ) − s ] (cid:12)(cid:12)(cid:12)(cid:12) C ′ U = c ′ U (cid:27) E C ′ U (cid:2) Pr { F > f · e − nT (cid:12)(cid:12) C ′ U } (cid:3) = E C ′ U  Pr (cid:26) X Q U | Y ∈Q ( y n , C ′ U ) X Q X | UY e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≥ e − ns (cid:12)(cid:12)(cid:12)(cid:12) C ′ U (cid:27) (97) . = E C ′ U  Pr (cid:26) max Q U | Y ∈Q ( y n , C ′ U ) max Q X | UY e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≥ e − ns (cid:12)(cid:12)(cid:12)(cid:12) C ′ U (cid:27) (98) . = E C ′ U  max Q U | Y ∈Q ( y n , C ′ U ) max Q X | UY Pr (cid:26) e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≥ e − ns (cid:12)(cid:12)(cid:12)(cid:12) C ′ U (cid:27) (99) . = E C ′ U  X Q U | Y ∈Q ( y n , C ′ U ) max Q X | UY Pr (cid:26) e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≥ e − ns (cid:12)(cid:12)(cid:12)(cid:12) C ′ U (cid:27) (100) = E C ′ U  X Q U | Y { Λ( Q U | Y ˆ Q Y , C ′ U ) ≥ } max Q X | UY Pr (cid:26) e nR X i =2 N i ( Q UXY ) e − nf ( Q UXY ) ≥ e − ns (cid:12)(cid:12)(cid:12)(cid:12) C ′ U (cid:27) (101) = X Q U | Y E C ′ U  { Λ( Q U | Y ˆ Q Y , C ′ U ) ≥ } max Q X | UY Pr (cid:26) e nR X i =2 N i ( Q UXY ) ≥ e n ( f ( Q UXY ) − s ) (cid:12)(cid:12)(cid:12)(cid:12) C ′ U (cid:27) (102) = 1 − Pr (cid:26) e nR X i =2 N i ( Q UXY ) < e n [ f ( Q UXY ) − s ] (cid:12)(cid:12)(cid:12)(cid:12) C ′ U = c ′ U (cid:27) (106) ≥ − exp (cid:8) − e n [ f ( Q UXY ) − s ] (cid:9) . (107)This term converges to 1 at least doubly exponentiallyfast and hence its exponent is 0.In summary, we have (108)–(109) on the top of the next page,where E ( Q UXY , t, r ):=  | β ( Q UXY , R ) + R − r | + if Q UXY ∈ L ( t, r, R ) ∞ else (110)and where L ( t, r, R ):= { Q UXY : t + f ( Q UXY ) ≤ | r − β ( Q UXY , R ) − R | + } . (111)Note that the ﬁrst clause in (108) comes from cases 1) and 3)above. The second clause in (108) comes from case 2) above.For brevity, deﬁne E ∗ ( Q UY , t, r ) := min Q X | UY E ( Q X | UY Q UY , t, r ) . (112)Therefore, recall that Q UXY = Q X | UY Q U | Y ˆ Q Y , by combin-ing (102), (109) and (112), we have the exponential equali-ties (113)–(115) on the top of the next page.Now we regard C ′ U as a collection of random variables.Consequently, Λ( Q U | Y ˆ Q Y , C ′ U ) is also a random variable. Using a similar derivation as in [18, Eqns. (36)–(39)], for anysufﬁciently small ǫ > , we have E C ′ U (cid:2) Pr { F > f · e − nT (cid:12)(cid:12) C ′ U } (cid:3) . = max Q U | Y X ≤ λ ≤ e nR Pr { Λ( Q U | Y ˆ Q Y , C ′ U ) = λ }× exp n − nE ∗ (cid:16) Q U | Y ˆ Q Y , − s, R + 1 n ln λ (cid:17)o (116) ≥ max Q U | Y X ≤ i ≤ R /ǫ Pr (cid:26) iǫ ≤ n ln Λ( Q U | Y ˆ Q Y , C ′ U ) ≤ ( i + 1) ǫ (cid:27) × exp (cid:8) − nE ∗ ( Q U | Y ˆ Q Y , − s, R + iǫ ) (cid:9) (117)Recalling Part 1 of Fact 1, we can evaluate the probability Pr { iǫ ≤ n ln Λ( Q U | Y ˆ Q Y , C ′ U ) ≤ ( i + 1) ǫ } as follows.1) Case γ ( Q U | Y ˆ Q Y , R ) < : Similarly as before, we seethat Λ( Q U | Y ˆ Q Y , C ′ U ) concentrates doubly exponentiallyaround its expectation which is, on the exponential scale, e − nγ ( Q U | Y ˆ Q Y ,R ) . In other words, we have Pr (cid:26) iǫ ≤ n ln Λ( Q U | Y ˆ Q Y , C ′ U ) ≤ ( i + 1) ǫ (cid:27) . = { iǫ ≤ − γ ( Q U | Y ˆ Q Y , R ) ≤ ( i + 1) ǫ } (118)2) Case γ ( Q U | Y ˆ Q Y , R ) ≥ : Similarly as before, we seethat, on the one hand, Pr (cid:26) n ln Λ( Q U | Y ˆ Q Y , C ′ U ) ≥ (cid:27) . = exp {− nγ ( Q U | Y ˆ Q Y , R ) } ; (119)on the other hand, Pr (cid:26) n ln Λ( Q U | Y ˆ Q Y , C ′ U ) ≥ ǫ (cid:27) . ≤ exp {− e nǫ } . (120) Pr (cid:26) e nR X i =2 N i ( Q UXY ) ≥ e n ( f ( Q UXY ) − s ) (cid:12)(cid:12)(cid:12)(cid:12) C ′ U = c ′ U (cid:27) . = exp  − n  (cid:12)(cid:12)(cid:12) β ( Q UXY ) − n ln Λ( Q U | Y ˆ Q Y , c ′ U ) (cid:12)(cid:12)(cid:12) + if f ( Q UXY ) − s ≤ (cid:12)(cid:12)(cid:12) n ln Λ( Q U | Y ˆ Q Y , c ′ U ) − β ( Q UXY ) (cid:12)(cid:12)(cid:12) + ∞ if f ( Q UXY ) − s > (cid:12)(cid:12)(cid:12) n ln Λ( Q U | Y ˆ Q Y , c ′ U ) − β ( Q UXY ) (cid:12)(cid:12)(cid:12) +  (108) = exp (cid:26) − nE (cid:16) Q UXY , − s, R + 1 n ln Λ( Q U | Y , c ′ U (cid:17)(cid:27) (109) E C ′ U (cid:2) Pr { F > f · e − nT (cid:12)(cid:12) C ′ U } (cid:3) . = X Q U | Y E C ′ U (cid:20) { Λ( Q U | Y ˆ Q Y , C ′ U ) ≥ } max Q X | UY exp n − nE (cid:16) Q UXY , − s, R + 1 n ln Λ( Q U | Y ˆ Q Y , C ′ U (cid:17)o(cid:21) (113) = X Q U | Y E C ′ U (cid:20) { Λ( Q U | Y ˆ Q Y , C ′ U ) ≥ } exp n − nE ∗ (cid:16) Q U | Y ˆ Q Y , − s, R + 1 n ln Λ( Q U | Y ˆ Q Y , C ′ U (cid:17)o(cid:21) (114) . = max Q U | Y E C ′ U (cid:20) { Λ( Q U | Y ˆ Q Y , C ′ U ) ≥ } exp n − nE ∗ (cid:16) Q U | Y ˆ Q Y , − s, R + 1 n ln Λ( Q U | Y ˆ Q Y , C ′ U (cid:17)o(cid:21) (115)Therefore, we have Pr (cid:26) iǫ ≤ n ln Λ( Q U | Y ˆ Q Y , C ′ U ) ≤ ( i + 1) ǫ (cid:27) . = { i = 0 } exp {− nγ ( Q U | Y ˆ Q Y , R ) } (121)In summary, we have Pr (cid:26) iǫ ≤ n ln Λ( Q U | Y ˆ Q Y , C ′ U ) ≤ ( i + 1) ǫ (cid:27) . = { iǫ ≤ | − γ ( Q U | Y ˆ Q Y , R ) | + ≤ ( i + 1) ǫ }× exp {− n | γ ( Q U | Y ˆ Q Y , R ) | + } (122)Therefore, by recalling the deﬁnitions of Φ( Q UXY , R , R ) and L ( ˆ Q XY , R , R , T ) (see (15) and (21)), and combining(117) and (122), we have the exponential equalities (123)–(126) on the top of the next page, where (123) is due to (122)and the fact that ǫ can be made arbitrarily small and (126) is We use the notation . = (i.e., equality to ﬁrst-order in the exponent) in (123)since the other direction of the inequality in (117) can be derived by replacing iǫ with ( i + 1) ǫ in the function E ∗ ( · ) . See [18, Eqns. (36)–(39)] for anotherinstance of this calculation. due to the fact that | β ( Q UXY , R ) − | − γ ( Q UXY , R ) | + | + + | γ ( Q UXY , R ) | + =  β ( Q UXY , R ) + γ ( Q UXY , R ) if γ ( Q UXY , R ) ≥ , β ( Q UXY , R ) ≥ γ ( Q UXY , R ) if γ ( Q UXY , R ) ≥ , β ( Q UXY , R ) ≤ β ( Q UXY , R ) + γ ( Q UXY , R ) if γ ( Q UXY , R ) ≤ , and β ( Q UXY , R ) + γ ( Q UXY , R ) ≥ if γ ( Q UXY , R ) ≤ , and β ( Q UXY , R ) + γ ( Q UXY , R ) ≤ (127) = Φ( Q UXY , R , R ) . (128)After averaging over ( U n (1) , X n (1 , , Y n ) , we have lim n →∞ − n ln E ( U n (1) ,X n (1 , ,Y n ) h E C ′ U [Pr { F > f · e − nT (cid:12)(cid:12) C ′ U } ] i = min ˆ Q UXY h D ( ˆ Q UXY k P UXY )+ min Q UX | Y ∈L ( ˆ Q XY ,R ,R ,T ) Φ( Q UX | Y ˆ Q Y , R , R ) i . (129)For the remaining term E ( U n (1) ,X n (1 , ,Y n ) [ E C ′ U [Pr { F > f · e − nT (cid:12)(cid:12) C ′ U } ]] (second term in (95)), the proof is similar to theproof for (129), therefore we will only provide an outline. E C ′ U (cid:2) Pr { F > f · e − nT |C ′ U } (cid:3) . = max Q U | Y (cid:20) exp {− nE ∗ ( Q U | Y ˆ Q Y , − s, R + | − γ ( Q U | Y ˆ Q Y , R ) | + ) } · exp {− n | γ ( Q U | Y ˆ Q Y , R ) | + } (cid:21) (123) . = exp (cid:26) − n min Q U | Y h E ∗ ( Q U | Y ˆ Q Y , − s, R + | − γ ( Q U | Y ˆ Q Y , R ) | + ) + | γ ( Q U | Y ˆ Q Y , R ) | + i(cid:27) (124) = exp (cid:26) − n min Q UX | Y h E ( Q UX | Y ˆ Q Y , − s, R + | − γ ( Q U | Y ˆ Q Y , R ) | + ) + | γ ( Q U | Y ˆ Q Y , R ) | + i(cid:27) (125) = exp (cid:26) − n min Q UX | Y ∈L ( ˆ Q XY ,R ,R ,T ) Φ( Q UX | Y ˆ Q Y , R , R ) (cid:27) , (126)Recalling the deﬁnition of N ( Q UXY ) in (66) and Part 3 ofFact 1, we have E C ′ U (cid:2) Pr { F > f · e − nT (cid:12)(cid:12) C ′ U } (cid:3) = Pr (cid:26) X Q X | UY N ( Q X | UY ˆ Q UY ) e − nf ( Q X | UY ˆ Q UY ) ≥ e − ns (cid:27) (130) . = exp n − n min Q X | UY E ( Q X | UY ˆ Q UY , − s, R ) o , (131)After averaging over ( U n (1) , X n (1 , , Y n ) , we have lim n →∞ − n ln E ( U n (1) ,X n (1 , ,Y n ) h E C ′ U [Pr { F > f · e − nT (cid:12)(cid:12) C ′ U } ] i = Ψ b . (132)Then, due to (95), we have E t1 = min { Ψ a , Ψ b } .For the total error probability of the message pair, accordingto the optimal decoding region (10), we obtain E C [ e t Y (1 , E C (cid:2) Pr { F ′ + F > f · e − nT (cid:12)(cid:12) C} (cid:3) , (133)where F ′ := X m ′ ∈M X m ′ ∈M \{ } W n Y ( Y n | X n ( m ′ , m ′ )) (134) = F + F . (135)As the difference between F and F ′ is only in the number of m ′ (the difference is exactly one and the rates are asymptoti-cally equal), the exponents of E C ′ U (cid:2) Pr { F > f · e − nT (cid:12)(cid:12) C ′ U } (cid:3) and E C ′ U (cid:2) Pr { F ′ > f · e − nT (cid:12)(cid:12) C ′ U } (cid:3) are identical. Therefore,we have E t Y = E t1 .Now we explain why E u1 = E t1 + T and E u Y = E t Y + T .In [14, Lemma 1], it was shown that, for discrete memorylesschannels W : X → Y , the undetected error exponent E u is equal to the sum of the total error exponent E t and thethreshold T . The main argument is based on the fact that theoptimal decoding region D ∗ m := ( y n : W n ( y n | x n ( m )) P m ′ = m W n ( y n | x n ( m ′ )) ≥ e nT ) (136)minimizes the following function Γ( C , D ) = e u + e − nT e t (137) for a given codebook C and a given threshold T , where e u and e t are the average total and undetected error probabilities,respectively. Moreover, the proof of [14, Lemma 1] does notdepend on the structure of the codebook and the closed-formexpressions of the exponents E t and E u . Therefore, we canuse the same idea to show that E u1 = E t1 + T and E u Y = E t Y + T since the optimal decoding regions D ∗ m and D ∗ m m deﬁnedin (9) and (10) also minimize Γ( C , D ) for the ABC.For constant composition random codes, since ( U n ( m ) , X n ( m , m )) ∈ T P UX for all ( m , m ) ∈M × M , all joint types Q UXY (resp. Q UY ) must satisfythe condition that their marginal distributions Q UX (resp. Q U ) are P UX (resp. P U ). Therefore, the results can be provedsimilarly to the case for i.i.d. random codes, except thatall types Q UXY (resp. Q UY ) must additionally satisfy thecondition that their marginal distributions Q UX (resp. Q U )are P UX (resp. P U ).This concludes the proof of Theorem 1.VII. P ROOF OF T HEOREM Proof of Theorem 2:

Firstly, we consider i.i.d. ran-dom codes. Assume the true transmitted message pair is ( m , m ) = (1 , . Deﬁne the (total) error event E as E := (cid:26) X m ′ =1 Pr( Y n |C ( m ′ )) > Pr( Y n |C (1)) e − nT (cid:27) . (138)The average total error probability for message m = 1 associated to the decoding region D ∗ m in (9) is given by E C [ e t2 (1 , E ( U n (1) ,X n (1 , ,Y n ) [Pr {E | ( U n (1) , X n (1 , , Y n ) } ] . (139)Recall the deﬁnitions of F ′ and F (see (135) and (60)).Similarly, for given ( U n (1) , X n (1 , , Y n ) = ( u n , x n , y n ) with joint type ˆ Q UXY , we have Pr {E } = Pr { F ′ > ( f + F ) · e − nT } . (140)For a given sub-codebook C ′ (1) := { X n ( m ,

1) : m ∈ M \{ }} , let k := 1 n ln( f + F ) = 1 n ln X m ∈M W n Y ( y n | x n ( m , (141) and so, the right-hand-side of the inequality inside the prob-ability of (140) is constant. Similarly to the calculation of E C ′ U (cid:2) Pr { F > f · e − nT |C ′ U } (cid:3) in (126), we obtain Pr n F ′ > e n ( k − T ) o . = exp {− nE ( ˆ Q Y , k − T, R , R ) } (142)where E ( ˆ Q Y , t, R , R ) (143) := min Q UX | Y (cid:20) E (cid:16) Q UX | Y ˆ Q Y , t, R + | − γ ( Q UX | Y ˆ Q Y , R ) | + (cid:17) + | γ ( Q UX | Y ˆ Q Y , R ) | + (cid:21) (144) = min Q UX | Y ∈L ( ˆ Q Y ,t,R ,R ) Ψ( Q UX | Y ˆ Q Y , R , R ) (145)and where L ( ˆ Q Y , t, R , R ) := n Q UX | Y : E Q UX | Y ˆ Q Y ln 1 W Y + t ≤ ∆( Q UX | Y ˆ Q Y , R , R ) o . (146)Next, we consider the scenario in which the sub-codebook C ′ (1) is random. Consequently, K = 1 n ln( f + F ) (147)is also random. Using a similar derivation as in [18,Eqns. (36)–(39)], for any sufﬁciently small ǫ > , we have Pr (cid:8) F ′ > ( f + F ) · e − nT (cid:9) . = X k Pr { K = k } · exp {− nE ( ˆ Q Y , k − T, R , R ) } (148) . ≤ X i Pr { iǫ ≤ K < ( i + 1) ǫ }× exp {− nE ( ˆ Q Y , iǫ − T, R , R ) } , (149)where i in the last inequality ranges from − f ( ˆ Q UXY ) /ǫ to R /ǫ .Recall the deﬁnition of N ( Q UXY ) (see (66)), and let Q UXY = Q X | UY ˆ Q UY , we have e nk = e − nf ( ˆ Q UXY ) + X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) . (150)Note that the ﬁrst term in the right side of (150) is ﬁxed. Forthe second term, we now evaluate the following probability Pr n e nt ≤ X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) ≤ e n ( t + ǫ ) o . (151)On the one hand, we have (similarly as before) Pr n X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) ≥ e nt o . = exp {− nE ∗ ( ˆ Q UY , t, R ) } (152)On the other hand, by using a similar derivation as in [18,Eqns. (30)–(34)] and [6, pp. 5081], we can derive the exponent of the probability of that P Q X | UY N ( Q UXY ) e − nf ( Q UXY ) isupper bounded by e n ( t + ǫ ) in the following steps. Firstly, wehave Pr n X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) ≤ e n ( t + ǫ ) o . = Pr n max Q X | UY N ( Q UXY ) e − nf ( Q UXY ) ≤ e n ( t + ǫ ) o (153) . = Pr (cid:26) \ Q X | UY n N ( Q UXY ) ≤ exp { n [ t + ǫ + f ( Q UXY )] } o(cid:27) . (154)Recall Part 3 of Fact 1, there are two cases for the probabilityof the events { N ( Q UXY ) ≤ exp { n [ t + ǫ + f ( Q UXY )] }} :1) Case β ( Q UXY , R ) < and [ t + ǫ + f ( Q UXY )] < − β ( Q UXY , R ) . From Part 3 of Proposition 5, we seethat Pr (cid:8) N ( Q UXY ) ≤ exp { n [ t + ǫ + f ( Q UXY )] } (cid:9) . = 0 (155)2) Case β ( Q UXY , R ) > or [ t + ǫ + f ( Q UXY )] ≥− β ( Q UXY , R ) . Similarly as before, for sufﬁcientlylarge n , we have Pr (cid:8) N ( Q UXY ) ≤ exp { n [ t + ǫ + f ( Q UXY )] } (cid:9) = 1 − Pr (cid:8) N ( Q UXY ) > exp { n [ t + ǫ + f ( Q UXY )] } (cid:9) (156) ≥ − exp {− n | β ( Q UXY , R ) |} → (157)Therefore, the probability in (154) is on the exponential scaleequal to the indicator function which returns if for every Q X | UY , either β ( Q UXY , R ) > or [ t + ǫ + f ( Q UXY )] ≥− β ( Q UXY , R ) , or equivalently, Pr n X Q X | UY N ( Q UXY ) ≤ e n [ t + ǫ + f ( Q UXY )] o . = (cid:26) min Q X | UY n β ( Q UXY , R ) + (cid:12)(cid:12) t + ǫ + f ( Q UXY ) (cid:12)(cid:12) + o ≥ (cid:27) (158)We now ﬁnd the minimum value of t + ǫ for which the value ofthis indicator function is unity. The condition in the indicatorfunction above is equivalent to min Q X | UY max ≤ a ≤ { β ( Q UXY , R ) + a [ t + ǫ + f ( Q UXY )] } ≥ (159)or, equivalently: ∀ Q X | UY ∃ ≤ a ≤ β ( Q UXY , R ) + a [ t + ǫ + f ( Q UXY )] ≥ , (160)which can also be written as ∀ Q X | UY ∃ ≤ a ≤ t + ǫ ≥ − f ( Q UXY ) − β ( Q UXY , R ) a . (161) This is equivalent to t + ǫ ≥ max Q X | UY min ≤ a ≤ (cid:20) − f ( Q UXY ) − β ( Q UXY , R ) a (cid:21) (162) = max Q X | UY " − f ( Q UXY ) − ( β ( Q UXY , R ) β ( Q UXY , R ) ≤ ∞ β ( Q UXY , R ) > (163) = − min Q X | UY : β ( Q UXY ,R ) ≤ [ f ( Q UXY ) + β ( Q UXY , R )] (164) = s ( ˆ Q UY , R ) (165)where the minimum in (164) over an empty set is deﬁned asinﬁnity.Furthermore, we need the following lemma which providessome useful properties of s ( ˆ Q UY , R ) deﬁned in (29) (alsosee (165)) and E ∗ ( ˆ Q UY , t, R ) deﬁned in (112). Using thislemma, we can obtain the exponent of the probability in (151). Lemma 8. s ( ˆ Q UY , R ) > −∞ , i.e., the set { Q X | UY : β ( Q X | UY ˆ Q UY , R ) ≤ } is not empty.2) E ∗ ( ˆ Q UY , t, R ) vanishes for all t ≤ s ( ˆ Q UY , R ) .3) E ∗ ( ˆ Q UY , t, R ) is strictly positive for all t >s ( ˆ Q UY , R ) .Proof of Lemma 8: See Appendix D.In summary, we have Pr n X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) < e n [ s ( ˆ Q UY ,R ) − ǫ ] o . = 0 (166) Pr n X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) ≥ e n [ s ( ˆ Q UY ,R )+ ǫ ] o . = exp {− nE ∗ ( ˆ Q UY , s ( ˆ Q UY , R ) + ǫ, R ) } (167)Furthermore, by using Lemma 8, we conclude that Pr n e n [ s ( ˆ Q UY ,R ) − ǫ ] ≤ X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) < e n [ s ( ˆ Q UY ,R )+ ǫ ] o . = 1 (168)Therefore, we have Pr (cid:8) F ′ > ( f + F ) · e − nT (cid:9) . ≤ X i Pr n e niǫ ≤ X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) ≤ e n [( i +1) ǫ ] o × exp n − nE (cid:0) ˆ Q Y , max { iǫ, − f ( ˆ Q UXY ) }− T, R , R (cid:1)o (169) where the expression max { iǫ, − f ( ˆ Q UXY ) } in the argumentof E ( ˆ Q Y , · , R , R ) is due to the fact that K = 1 n ln (cid:20) e − nf ( ˆ Q UXY ) + X Q X | UY N ( Q UXY ) e − nf ( Q UXY ) (cid:21) (170) ≥ n ln h e − nf ( ˆ Q UXY ) + e niǫ i (171) . = max { iǫ, − f ( ˆ Q UXY ) } (172)By using the fact that ǫ above can be made arbitrarily small,we obtain Pr (cid:8) F ′ > ( f + F ) · e − nT (cid:9) . = exp (cid:8) − nE ( ˆ Q Y , max { s ( ˆ Q UY , R ) , − f ( ˆ Q UXY ) } − T,R , R ) (cid:9) (173)where (173) is due to the fact that the dominant contri-bution to the sum over i is due to the term indexed by i = s ( ˆ Q UY , R ) /ǫ . This, itself, follows from (168) and (169)as well as the fact that t E ( ˆ Q Y , t, R , R ) , as deﬁned in(145), is non-decreasing.Note that s ( ˆ Q UY , R ) and f ( ˆ Q UXY ) are constant (given ( U n (1) , X n (1 , , Y n ) = ( u n , x n , y n ) ). Once again, by usingthe fact that the function E ( ˆ Q Y , t, R , R ) , as deﬁned in(145), is non-decreasing in the parameter t , we have E ( ˆ Q Y , max { s ( ˆ Q UY , R ) , − f ( ˆ Q UXY ) } − T, R , R )= max (cid:8) E ( ˆ Q Y , − f ( ˆ Q UXY ) − T, R , R ) ,E ( ˆ Q Y , s ( ˆ Q UY , R ) − T, R , R ) (cid:9) (174) = max ( min Q UX | Y ∈L ( ˆ Q XY ,R ,R ,T ) Φ( Q UX | Y ˆ Q Y , R , R ) , min Q UX | Y ∈L ( ˆ Q UXY ,R ,R ,T ) Φ( Q UX | Y ˆ Q Y , R , R ) ) (175)By combining (139), (173) and (175) and averaging over ( U n (1) , X n (1 , , Y n ) , we have lim n →∞ − n ln E C [ e t2 (1 , n →∞ − n ln E ( U n (1) ,X n (1 , ,Y n ) (cid:2) Pr {E | ( U n (1) , X n (1 , , Y n ) } (cid:3) (176) = max { Ψ a , Ψ c } . (177)Finally, the equality E u2 = E t2 + T can be obtained by [14,Lemma 1] and the same argument as that used to justify E u1 = E t1 + T in Theorem 1.For constant composition random codes, by using the sameargument in the end of the proof of Theorem 1, the result canbe obtained. This concludes the proof of Theorem 2. Remark 1.

The maximization operations in (174) and (175)lead the somewhat unusual maximization in the error exponent E t2 in (177) and hence (25) in the theorem statement. Weprovide some intuition for it here. Recall the BSC example in Section V-A. Note that the input distribution P UX is givenand may be chosen in a sub-optimal manner so the regions in(55) and (56) are not capacity regions. • If the ﬁrst inner minimization in (175), pertain-ing to f ( ˆ Q UXY ) in (174), achieves the maximum,this corresponds to terminal Y using the channel W n Y ( y n | X n ( m , m )) to decode the true transmittedcodewords X n ( m , m ) to ﬁnd m . On the other hand,if terminal Y decodes m successfully by using thisoption, roughly speaking, this corresponds to the event { F ′ ≤ f e − nT } (see (140)) occurring. This case isanalogous to the rate constraint R + R ≤ I ( X ; Y ) in (56). • If the second inner minimization in (175), pertaining to s ( ˆ Q UY , R ) in (174), achieves the maximum, this cor-responds to Y using the induced channel Pr( y n |C ′ ( m )) to decode the sub-codebook C ′ ( m ) = { X n ( m , m ) : m ∈ [ M ] \ { }} . On the other hand, if terminal Y decodes the message m successfully by using thisoption, roughly speaking, this means that the event { F ′ ≤ F e − nT } (see (140)) occurs. This case is analogous to therate constraint R ≤ I ( U ; Y ) in (56).The union in (56) also corroborates the existence of the maximum in (175). A PPENDIX AP ROOF OF PROPOSITION Proof:

Let Q ∗ and Q ∗ (two distributions of the form Q = Q X | UY ˆ Q UY ) be optimal solutions to the modiﬁed andoriginal inner optimizations of Ψ ′ b1 and Ψ b1 , respectively.Assume, to the contrary, that β ( Q ∗ ) > . Moreover, note that β ( Q ∗ ) < by the assumption. Due to the continuity of β ( Q ) in Q , there exists a conditional probability distribution ¯ Q X | UY such that ¯ Q = αQ ∗ + (1 − α ) Q ∗ , where ¯ Q = ¯ Q X | UY ˆ Q UY ,for some α ∈ (0 , that satisﬁes β ( ¯ Q ) = 0 , . As the ﬁrstconstraint in L is convex in Q , the solution ¯ Q is feasible(for Ψ b1 ). Note that the optimal value of objective function (in Ψ b1 ) is β ( Q ∗ ) > while β ( ¯ Q ) = 0 . This is a contradiction.Hence, there exists an optimal solution to the original inneroptimization problem Ψ b1 satisfying β ( Q X | UY ˆ Q UY ) = 0 .Moreover, this optimal solution of Ψ b1 (i.e., ( ˆ Q ∗ , Q ∗ ) with β ( Q ∗ X | UY ˆ Q ∗ UY ) = 0 ) is also feasible for Ψ b2 . As a result, inthis case, the optimal value of Ψ b is equal to that for Ψ b2 because Ψ b = min { Ψ b1 , Ψ b2 } .A PPENDIX BP ROOF OF PROPOSITION Proof:

Let Q ∗ , Q ∗ and Q ∗ (three distributions of the form Q ∗ = Q ∗ XU | Y ˆ Q Y ) be optimal solutions to the modiﬁed andnew optimizations Φ ′ a1 , Φ ′ a3 and Φ ∗ a5 , respectively. There aretwo cases for the solution Q ∗ , namely case (i) γ ( Q ∗ ) ≥ andcase (ii) γ ( Q ∗ ) ≤ .In case (i), as the solution Q ∗ is also optimal for the problem Φ ′ a1 , we only need to consider the solution Q ∗ to the problem Φ ′ a3 . Note that the convex objective functions of Φ ′ a3 and Φ ∗ a5 are the same and the convex feasible set of Φ ′ a3 is a subsetof the convex feasible set of Φ ∗ a5 . Then the solution Q ∗ must satisfy γ ( Q ∗ ) = 0 by using a similar argument as that for Ψ b1 in the proof of Proposition 3. Moreover, we may assume thatthis solution Q ∗ is feasible for the original problem Φ ∗ a3 (if not,similar to the discussion for Ψ b1 , we do not need to considerthis term Φ ∗ a3 in (52) due to Φ ∗ a4 ). Hence, the optimal solution Q ∗ to the problem Φ ∗ a3 satisﬁes γ ( Q ∗ ) = 0 and β ( Q ∗ )+0 ≥ is also feasible for the problem Φ ∗ a1 . Therefore, we can removethe term Φ ∗ a3 in the inner minimization of (52).For case (ii), using a similar argument as above, we canshow that the solution Q ∗ with γ ( Q ∗ ) = 0 and β ( Q ∗ ) ≥ isfeasible for the problem Φ ∗ a3 , therefore, we can remove thisterm Φ ∗ a1 in the inner minimization of (52).In summary, without loss of optimality, we can replace Φ ∗ a1 and Φ ∗ a3 with the new convex optimization problem Φ ∗ a5 .Moreover, the optimal value of Φ ∗ a5 is active (see (46)) in theinner minimization problem in (52) if (cid:8) γ ( Q ∗ ) ≥ ∩ β ( Q ∗ ) ≥ (cid:9) [(cid:8) γ ( Q ∗ ) ≤ ∩ γ ( Q ∗ ) + β ( Q ∗ ) ≥ (cid:9) . (178)This completes the proof of Proposition 4.A PPENDIX CP ROOF OF L EMMA Proof:

We are given ( U n (1) , X n (1 , , Y n ) =( u n , x n , y n ) , C ′ U = c ′ U and T ≥ . Note also that f ,deﬁned in (61), is constant/deterministic in this proof. Inthe following, we omit the dependence on the conditioningevent { ( U n (1) , X n (1 , , Y n ) = ( u n , x n , y n ) , C ′ U = c ′ U } fornotational convenience. Now we have Pr (cid:8) F + F > max { f , F } · e − nT (cid:9) = Pr (cid:8) F + F > F · e − nT , F > f (cid:9) + Pr { F ≤ f }× Pr (cid:8) F + F > f · e − nT | F ≤ f (cid:9) (179) = Pr (cid:8) F + F > F · e − nT , F > f (cid:9) + Pr { F ≤ f } · Pr (cid:8) F + F > f · e − nT (cid:9) (180) . = Pr (cid:8) F + F > F · e − nT , F > f (cid:9) + Pr { F ≤ f }× Pr (cid:8) max { F , F } > f · e − nT (cid:9) (181) . = Pr (cid:8) F + F > F · e − nT , F > f (cid:9) + Pr { F ≤ f }× max (cid:8) Pr { F > f · e − nT } , Pr { F > f · e − nT } (cid:9) , (182)where (180) is due to the fact that ( F , F ) is independent of F given Y n = y n and C ′ U = c ′ U (See the deﬁnitions of F , F and F in (59)–(62)), (181) is due to the fact that f · e − nT is exponentially small, and the interchange of max {·} and Pr {·} in (182) is justiﬁed similarly as [13, Eqn. (37)] and [18,Eqns. (15)–(20)].Recall the random codebook generation with superpositionstructure as described in Section II-B. We may rewrite F as F = X m ′ ∈M \{ , } X m ′ ∈M \{ } W n Y ( y n | X n ( m ′ , m ′ ))+ X m ′ ∈M \{ } W n Y ( y n | X n (2 , m ′ )) . (183) Note that all the X n ( m , m ) terms in F and F aregenerated in an i.i.d. manner. Hence the ﬁnal term in (183),a non-negative random variable, has the same distribution as F . Since W n Y ( y n |· ) ≥ and e − nT ≤ , for any given c ′ U , weobtain Pr { F > f · e − nT } ≥ Pr { F > f } ≥ Pr { F > f } . (184)The second inequality in (184) follows from the fact that ifwe have two random variables A and A ′ which have the samedistribution and B is a non-negative random variable, thenclearly Pr { A + B ≥ c } ≥ Pr { A ′ ≥ c } for all c ∈ R . Now,we focus on the sequence η n := Pr { F > f } . Assume thatthe limit of η n exists (otherwise, we may pick a convergentsubsequence and work with that subsequence in the following).There are two cases: case (i) lim n →∞ η n = 0 and case (ii) lim n →∞ η n > . • For case (i), from (182), we have: Pr (cid:8) F + F > max { f , F } · e − nT (cid:9) . = Pr (cid:8) F + F > F · e − nT , F > f (cid:9) + max (cid:8) Pr { F > f · e − nT } , Pr { F > f · e − nT } (cid:9) (185) . = max (cid:8) Pr (cid:8) F + F > F · e − nT , F > f (cid:9) , Pr { F > f · e − nT } , Pr { F > f · e − nT } (cid:9) (186) . = max (cid:8) Pr { F > f · e − nT } , Pr { F > f · e − nT } (cid:9) , (187)where (185) is due to the fact that Pr { F ≤ f } = 1 − η n tends to 1, and the ﬁnal step (187) is due to the fact that Pr (cid:8) F + F > F · e − nT , F > f (cid:9) ≤ Pr { F > f } (188) ≤ Pr { F > f · e − nT } , (189)and where (189) is due to (184). • For case (ii), on the one hand, we have Pr (cid:8) F + F > F · e − nT , F > f (cid:9) ≥ − Pr (cid:8) F + F ≤ F · e − nT (cid:9) − Pr { F ≤ f } (190) ≥ − Pr { F ≤ F · e − nT } − Pr { F ≤ f } (191) ≥ Pr { F > f } − e − nR / (192) . = 1 , (193)where (191) is due to the fact that F ≥ , (192) is due toLemma 6 and (193) is due to the fact that (192) does nottend to (and is obviously bounded above by ) from theassumption that lim n →∞ η n > . From (182), we have Pr (cid:8) F + F > max { f , F } · e − nT (cid:9) . = 1 . (194)On the other hand, we have max (cid:8) Pr { F > f · e − nT } , Pr { F > f · e − nT } (cid:9) ≥ Pr { F > f · e − nT } (195) ≥ Pr { F > f } (196) . = 1 , (197) where (196) is due to (184), and (197) is due to the factthat (196) does not tend to from the assumption that lim n →∞ η n > . Thus, for case (ii), combining (194)and (197), we have Pr (cid:8) F + F > max { f , F } · e − nT (cid:9) . = max (cid:8) Pr { F > f · e − nT } , Pr { F > f · e − nT } (cid:9) . (198)Since for both cases, we arrive at the same conclusions in (187)and (198), this completes the proof of Lemma 7.A PPENDIX DP ROOF OF L EMMA Proof:

The three parts of Lemma 8 are proved as follows:1) Recall the deﬁnition of β ( Q UXY , R ) (see (13)). Let Q ′ X | UY = P X | U , we have β ( P X | U ˆ Q UY , R )= D ( P X | U k P X | U | ˆ Q U ) + I P X | U ˆ Q UY ( X ; Y | U ) − R (199) = − R < (200)Thus, there exists a conditional distribution Q ′ X | UY = P X | U belonging to { Q X | UY : β ( Q X | UY ˆ Q UY , R ) ≤ } .2) As E ∗ ( ˆ Q UY , t, R ) is non-decreasing in t , we onlyneed to show that E ∗ ( ˆ Q UY , t, R ) = 0 when t = s ( ˆ Q UY , R ) . From the conclusion above, we have s ( ˆ Q UY , R ) > −∞ . Assume that the optimal solutioncorresponding to s ( ˆ Q UY , R ) is Q ∗ = Q ∗ X | UY ˆ Q UY .Now, we take Q UXY = Q ∗ for the constraint in L ( s ( ˆ Q UY , R ) , R , R ) in (111), then we have s + f ( Q ∗ ) − | R − β ( Q ∗ , R ) − R | + = [ − f ( Q ∗ ) − β ( Q ∗ , R )] + f ( Q ∗ ) − | − β ( Q ∗ , R ) | + (201) = [ − f ( Q ∗ ) − β ( Q ∗ , R )] + f ( Q ∗ ) + β ( Q ∗ , R ) (202) = 0 , (203)where (202) is because Q ∗ satisﬁes the constraint β ( Q ∗ , R ) ≤ in (164). Therefore, Q ∗ ∈L ( s ( ˆ Q UY , R ) , R , R ) and | β ( Q ∗ , R ) | + = 0 . Thus, E ∗ ( ˆ Q UY , t, R ) = 0 for t ≤ s ( ˆ Q UY , R ) .3) Recall the deﬁnition E ( Q X | UY ˆ Q UY , t, R ) and E ∗ ( ˆ Q UY , t, R ) (see (110) and (112)). We only need toshow that any conditional probability distribution Q X | UY such that β ( Q X | UY ˆ Q UY , R ) ≤ satisﬁes the condition Q X | UY / ∈ L ( t, R , R ) . Assume, to the contrary, thatthere exist a conditional probability distribution ˜ Q X | UY such that β ( ˜ Q, R ) ≤ and ˜ Q ∈ L ( t, R , R ) , where ˜ Q = ˜ Q X | UY ˆ Q UY . Now, we have t ≤ − f ( ˜ Q ) + | R − β ( ˜ Q, R ) − R | + (204) = − f ( ˜ Q ) − β ( ˜ Q, R ) (205) ≤ max β ( ˜ Q,R ) ≤ [ − f ( ˜ Q ) − β ( ˜ Q, R )] (206) = s ( ˆ Q UY , R ) (207) where (204) is because ˜ Q ∈ L ( t, R , R ) and (205)is because β ( ˜ Q, R ) ≤ . However, note that t >s ( ˆ Q UY , R ) as assumed in Lemma 8. This is a con-tradiction. Hence, E ∗ ( ˆ Q UY , t, R ) is strictly positive forall t > s ( ˆ Q UY , R ) .These justiﬁcations complete the proof of Lemma 8. Acknowledgements:

The authors are indebted to the as-sociate editor Prof. Neri Merhav and the two anonymousreviewers for extremely detailed comments that have helpedto improve the clarity of the paper. The authors also thankDr. Anshoo Tandon for discussions related to the examples inSection I-D. R

EFERENCES[1] D. Cao and V. Y. F. Tan, “Exact error and erasure exponents for theasymmetric broadcast channel,” in

IEEE Intl. Symp. on Inf. Theory , Vail,CO, 2018, pp. 1690–1694.[2] T. Cover, “Broadcast channels,”

IEEE Trans. on Inform. Theory , vol. 18,no. 1, pp. 2–14, Jan 1972.[3] J. K¨orner and K. Marton, “General broadcast channels with degradedmessage sets,”

IEEE Trans. on Inform. Theory , vol. 23, no. 1, pp. 60–64,Jan 1977.[4] J. K¨orner and A. Sgarro, “Universally attainable error exponents forbroadcast channels with degraded message sets,”

IEEE Trans. on Inform.Theory , vol. 26, no. 6, pp. 670–679, Nov 1980.[5] Y. Kaspi and N. Merhav, “Error exponents for broadcast channels withdegraded message sets,”

IEEE Trans. on Inform. Theory , vol. 57, no. 1,pp. 101–123, Jan 2011.[6] R. Averbuch and N. Merhav, “Exact random coding exponents anduniversal decoders for the asymmetric broadcast channel,”

IEEE Trans.on Inform. Theory , vol. 64, no. 7, pp. 5070–5086, July 2018.[7] R. Averbuch, N. Weinberger, and N. Merhav, “Expurgated bounds forthe asymmetric broadcast channel,” arXiv:1711.10299 , 2017.[8] G. Forney, “Exponential error bounds for erasure, list, and decisionfeedback schemes,”

IEEE Trans. on Inform. Theory , vol. 14, no. 2, pp.206–220, Mar 1968.[9] I. Csisz´ar and J. K¨orner,

Information Theory: Coding Theorems forDiscrete Memoryless Systems . Cambridge University Press, 2011.[10] E. Telatar, “Multi-access communications with decision feedback decod-ing,” Ph.D. dissertation, Massachusetts Institute of Technology, 1992.[11] P. Moulin, “A Neyman-Pearson approach to universal erasure and listdecoding,”

IEEE Trans. on Inform. Theory , vol. 55, no. 10, pp. 4462–4478, Oct 2009.[12] N. Merhav, “Error exponents of erasure/list decoding revisited viamoments of distance enumerators,”

IEEE Trans. on Inform. Theory ,vol. 54, no. 10, pp. 4439–4447, Oct 2008.[13] A. Somekh-Baruch and N. Merhav, “Exact random coding exponentsfor erasure decoding,”

IEEE Trans. on Inform. Theory , vol. 57, no. 10,pp. 6444–6454, Oct 2011.[14] W. Huleihel, N. Weinberger, and N. Merhav, “Erasure/list random codingerror exponents are not universally achievable,”

IEEE Trans. on Inform.Theory , vol. 62, no. 10, pp. 5403–5421, Oct 2016.[15] N. Weinberger and N. Merhav, “Simpliﬁed erasure/list decoding,”

IEEETrans. on Inform. Theory , vol. 63, no. 7, pp. 4218–4239, July 2017.[16] M. Hayashi and V. Y. F. Tan, “Asymmetric evaluations of erasure andundetected error probabilities,”

IEEE Trans. on Inform. Theory , vol. 61,no. 12, pp. 6560–6577, Dec 2015.[17] V. Y. F. Tan, “Error and erasure exponents for the asymmetric broadcastchannel,” in

IEEE Information Theory Workshop - Fall , Jeju, S. Korea,2015, pp. 153 – 157.[18] N. Merhav, “Exact random coding error exponents of optimal bin indexdecoding,”

IEEE Trans. on Inform. Theory , vol. 60, no. 10, pp. 6024–6031, Oct 2014.[19] A. El Gamal and Y.-H. Kim,

Network Information Theory . CambridgeUniversity Press, 2011.[20] E. N. Bernard and C. Meinig, “History and future of deep-ocean tsunamimeasurements,” in

IEEE OCEANS’11 MTS/IEEE KONA , Waikoloa, HI,USA, 2011.[21] E. Bernard and V. Titov, “Evolution of tsunami warning systems andproducts,”

Philosophical Transactions of the Royal Society A: Math-ematical, Physical and Engineering Sciences. , vol. 373, p. 20140371,2015. [22] F. Arena and G. Pau, “An overview of vehicular communications,”

Entropy , vol. 11, no. 2, p. 27, Jan 2019.[23] W. Huleihel and N. Merhav, “Random coding error exponents for thetwo-user interference channel,”

IEEE Trans. on Inform. Theory , vol. 63,no. 2, pp. 1019–1042, Feb 2017.[24] R. H. Etkin, N. Merhav, and E. Ordentlich, “Error exponents of optimumdecoding for the interference channel,”

IEEE Trans. on Inform. Theory ,vol. 56, no. 1, pp. 40–56, Jan 2010.[25] N. Merhav, “Relations between random coding exponents and thestatistical physics of random codes,”

IEEE Trans. on Inform. Theory ,vol. 55, no. 1, pp. 83–92, Jan 2009.

Daming Cao received the B.Eng. degree in information engineering fromSoutheast University, Nanjing, China, in 2013. He is currently working towardthe Ph.D. degree from the School of Information Science and Engineering,Southeast University. From Oct 2017 to Sep 2018, he was a visiting studentin the Department of Electrical and Computer Engineering at the NationalUniversity of Singapore. His research interests include information theory,network coding, and security.

Vincent Y. F. Tan (S’07-M’11-SM’15) was born in Singapore in 1981. He iscurrently a Dean’s Chair Associate Professor in the Department of Electricaland Computer Engineering and the Department of Mathematics at the NationalUniversity of Singapore (NUS). He received the B.A. and M.Eng. degrees inElectrical and Information Sciences from Cambridge University in 2005 andthe Ph.D. degree in Electrical Engineering and Computer Science (EECS)from the Massachusetts Institute of Technology (MIT) in 2011. His researchinterests include information theory, machine learning, and statistical signalprocessing.Dr. Tan received the MIT EECS Jin-Au Kong outstanding doctoral thesisprize in 2011, the NUS Young Investigator Award in 2014, the SingaporeNational Research Foundation (NRF) Fellowship (Class of 2018) and the NUSYoung Researcher Award in 2019. He is also an IEEE Information TheorySociety Distinguished Lecturer for 2018/9. He has authored a research mono-graph on “Asymptotic Estimates in Information Theory with Non-VanishingError Probabilities”“Asymptotic Estimates in Information Theory with Non-VanishingError Probabilities”