[PDF] Strong Converses Are Just Edge Removal Properties

Abstract

This paper explores the relationship between two ideas in network information theory: edge removal and strong converses. Edge removal properties state that if an edge of small capacity is removed from a network, the capacity region does not change too much. Strong converses state that, for rates outside the capacity region, the probability of error converges to 1 as the blocklength goes to infinity. Various notions of edge removal and strong converse are defined, depending on how edge capacity and error probability scale with blocklength, and relations between them are proved. Each class of strong converse implies a specific class of edge removal. The opposite directions are proved for deterministic networks. Furthermore, a technique based on a novel, causal version of the blowing-up lemma is used to prove that for discrete memoryless networks, the weak edge removal property--that the capacity region changes continuously as the capacity of an edge vanishes--is equivalent to the exponentially strong converse--that outside the capacity region, the probability of error goes to 1 exponentially fast. This result is used to prove exponentially strong converses for several examples, including the discrete 2-user interference channel with strong interference, with only a small variation from traditional weak converse proofs.

Full PDF

aa r X i v : . [ c s . I T ] D ec Strong Converses Are Just Edge RemovalProperties

Oliver Kosut,

Member, IEEE and J ¨org Kliewer,

Senior Member, IEEE

Abstract

This paper explores the relationship between two ideas in network information theory: edge removaland strong converses. Edge removal properties state that if an edge of small capacity is removed froma network, the capacity region does not change too much. Strong converses state that, for rates outsidethe capacity region, the probability of error converges to 1 as the blocklength goes to inﬁnity. Variousnotions of edge removal and strong converse are deﬁned, depending on how edge capacity and errorprobability scale with blocklength, and relations between them are proved. Each class of strong converseimplies a speciﬁc class of edge removal. The opposite directions are proved for deterministic networks.Furthermore, a technique based on a novel, causal version of the blowing-up lemma is used to prove thatfor discrete memoryless networks, the weak edge removal property—that the capacity region changescontinuously as the capacity of an edge vanishes—is equivalent to the exponentially strong converse—that outside the capacity region, the probability of error goes to 1 exponentially fast. This result is usedto prove exponentially strong converses for several examples, including the discrete 2-user interferencechannel with strong interference, with only a small variation from traditional weak converse proofs.

Index Terms:

Strong converse, edge removal, network information theory, reduction results, blowing-up lemma.

I. I

NTRODUCTION

Consider a general network communication scenario given an arbitrary collection of sourcesand sinks connected via an arbitrary network channel. The sources are independent and each

O. Kosut is with the School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287USA (email: [email protected]).J. Kliewer is with the Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ07102 USA (email: [email protected]).This work was presented in part at the 2016 IEEE International Symposium on Information Theory.This material is based upon work supported by the National Science Foundation under Grant No. CCF-1439465, CCF-1440014,CNS-1526547, CCF-1453718. source is demanded by a subset of sinks, where this subset can be different for each sink. Ageneral interest in network information theory is to determine the capacity of such networks,deﬁned as the set of achievable rates for each source. As this problem is known to be challenging,we consider the simpler problem of how the capacity of these networks change if only a singleedge is removed from the network. This problem has ﬁrst been studied by [1], [2]. The authorshave shown that for acyclic noiseless networks and a variety of demand types for which the cut-set bound is tight, removing an edge of capacity δ reduces the capacity of each min-cut by at most δ in each dimension. Further, in [3] it has been shown for a noiseless multiple multicast demandthat this edge removal property also holds for generalized network sharing outer bound [4]; forthe linear programming outer bound [5], [3] shows that removing an edge of capacity δ reducesthe capacity by at most Kδ , where K depends only on the network. In addition, the existence ofthe edge removal property has for example been tied to the problem whether a network codinginstance allows a reconstruction with ǫ or zero error [6], [7], respectively. Another example isthe connection of edge removal to the equivalency between a network coding instance and acorresponding index coding problem [8]. Recently, it has been shown that for a multiple-accesschannel with a so called “cooperation facilitator” [9]–[13] the edge removal property does nothold. In particular, for this setting the authors show the surprising result that adding a smallcapacity edge can lead to a signiﬁcant increase in network capacity. These results have alsobeen extended to networks with state [14] and to edges which can carry only a single bit overall times under the maximal error criterion [15]. However, despite the signiﬁcant progress thathas been made to understand scenarios in which the edge removal property holds, the solutionto the general problem is open.In this work, we address the connection of edge removal to the existence of strong conversesfor networks subject to an average probability of error constraint. As far as we know, thisconnection has been explored in the literature only brieﬂy in [16, Chap. 3, p. 48]. The strongconverse theorem states that the error probability converges to 1 for large blocklengths n if therate exceeds the capacity. This is in contrast to a weak converse which only indicates that theerror probability is bounded away from zero if we operate at a rate beyond capacity. The beneﬁtof a strong converse is that it strengthens the interpretation of capacity as a sharp phase transitionin achievable probability of error. It also allows for the following interesting interpretation: ifa strong converse exists for a given network instance, ǫ reliable codes (i.e., codes which allowreconstruction with ǫ error) must have rate tuples within the capacity region for ǫ ∈ [0 , and large n . Thus, a strong converse reﬁnes a capacity (or ﬁrst-order) result, which provides onlythe limiting behavior as the probability of error vanishes and the blocklength goes to inﬁnity.However, a strong converse does not provide as much reﬁnement as a second-order (or dispersion)result [17], which clariﬁes the (usually O (1 / √ n ) ) backoff from capacity for small blocklengthsand ﬁxed probability of error. Therefore, strong converses constitute “one-and-a-half-th order”results. Strong converses have been established for numerous problems, including point-to-pointsettings, e.g., for discrete memoryless channels [18] and quantum channels [19], [20]. Recently ithas been shown that a strong converse holds for a discrete memoryless networks with tight cut-setbounds [21]. There has also been work establishing exponentially strong converses , which statethat for any rate vector outside the asymptotically-zero error capacity region, the error probabilityapproaches 1 exponentially fast. Exponentially strong converses have been considered for point-to-point channels in [22], [23], and for several network problems in [24]–[27].In the following, we categorize the notions of edge removal and strong converses into differentclasses depending on how edge capacity and error probability, resp., scale with blocklength, anddemonstrate relations between these instances. See Fig. 1 for a summary of our results. Inparticular, our contributions are as follows:1) We show that each speciﬁc class of strong converse always implies a speciﬁc class of edgeremoval. This implication holds in great generality: whether the network channel modelis deterministic or probabilistic, discrete or continuous, or even whether it has memory.2) We show that implications in the opposite direction (edge removal implies strong converse)hold in some cases. In particular, we show that each opposite direction holds for determin-istic networks. However, these opposite directions do not always hold; for example, for asimple discrete memoryless point-to-point channel, each edge removal property holds, butthe strongest form of the strong converse—the extremely strong converse —does not hold.3) We further show that for all discrete memoryless stationary networks, the exponentiallystrong converse is equivalent to the weak edge removal property. The weak edge removalproperty states that if a small edge with rate growing sublinear in the blocklength isremoved, the asymptotically-zero error capacity region does not change. The proof isbased on a novel, causal version of the blowing-up lemma [28].4) We demonstrate that for networks composed of independent point-to-point links withacyclic topology, a similar equivalence holds for weaker conditions—between the ordinarystrong converse and what we call the very weak edge removal property, wherein the edge carries an unbounded number of bits that grows very slowly with blocklength.5) These results, particularly the equivalence between weak edge removal and the expo-nentially strong converse, enable us to, without much effort, strengthen many existingcomputable outer bounds or weak converses to prove that they hold in an exponentiallystrong sense. We demonstrate this for the cut-set bound, reproducing the result of [21]to show that for rates outside the region deﬁned by cut-set bound, the probability oferror converges to 1 exponentially fast. We also prove exponentially strong converses fordiscrete broadcast channels, and for the discrete 2-user interference channel with stronginterference.All the above mentioned reduction results between edge removal and strong converses revealthe surprising fact that for many cases, satisfying edge removal—a condition related only toﬁrst-order capacity—implies a seemingly stronger “one-and-a-half-th order” property, namelythe existence of a speciﬁc version of a strong converse indicated by the leftward arrows inFig. 1. This highlights again the power of the edge removal property.This paper is organized as follows. We ﬁrst introduce the model and deﬁnitions of variousstrong converse and edge removal properties in Sec. II. After that, in Sec. III we prove that strongconverses imply edge removal properties. The opposite directions for deterministic networks isthen proven in Sec. IV. Then, in Sec. V we prove one of the main results in this paper, namelyequivalence between weak edge removal and the exponentially strong converse for discretestationary memoryless. We then show equivalence between very weak edge removal and theordinary strong converse for networks of independent point-to-point links in Sec. VI. After that,in Sec. VII we derive several applications of our results, including the cut-set bound, broadcastchannels, and interference channel. Finally, Sec. VIII offers the conclusions.II. M ODEL AND D EFINITIONS

We begin by introducing notation to be used throughout the paper. Subsequently we introduceour network model, and formally deﬁne the notions of strong converse and edge removal that willbe the main focus, while proving some simple properties of these deﬁnitions. There are numberof subtly different deﬁnitions of rate regions: we summarize them in Table I for convenience.

Notation:

For an integer k we deﬁne [1 : k ] = { , . . . , k } . All logarithms and exponentialshave base . The notation ( a n ) n represents an inﬁnite sequence of values a n for each positiveinteger n . For sequences ( a n ) n , ( b n ) n , we write a n . = b n if log( a n ) /n and log( b n ) /n have the same limit as n → ∞ . Given two probability distributions P and Q on the same alphabet X ,the relative entropy (for discrete distributions) is given by D ( P k Q ) = X x ∈X P ( x ) log P ( x ) Q ( x ) . (1)Given conditional distributions P Y | X and Q Y | X , and marginal distribution R X , the conditionalrelative entropy is given by D ( P Y | X k Q Y | X | R X ) = X x,y R X ( x ) P Y | X ( y | x ) log P Y | X ( y | x ) Q Y | X ( y | x ) . (2)The total variational distance (for discrete distributions) is given by d TV ( P, Q ) = 12 X x ∈X | P ( x ) − Q ( x ) | . (3)The Hamming distance between two sequences x n , y n ∈ X n is denoted d H ( x n , y n ) = |{ t ∈ [1 : n ] : x t = y t }| . (4)For a set A ⊆ R n , A indicates the closure of A with respect to the Euclidean distance. Wedenote the set of nonnegative real numbers by R + . Given a vector x = ( x , . . . , x n ) ∈ R n anda scalar γ ∈ R , we denote the vector-scalar sum as x + γ = ( x + γ, . . . , x n + γ ) . (5)Given a sets A , B ⊆ R n we denote the set sum as A + B = { x + y : x ∈ A , y ∈ B} . (6) A. Network Model

We begin with a network model for an arbitrary causal network channel. Many of our resultsapply only for discrete memoryless networks or deterministic networks, but some basic resultsapply in much more generality.Consider a network consisting of d nodes, where node i ∈ [1 : d ] wishes to convey a message W i at rate R i to a set of destination nodes D i ⊆ [1 : d ] . The channel model consists of: • An input alphabet X i for each i ∈ [1 : d ] , • An output alphabet Y i for each i ∈ [1 : d ] , We assume for simplicity that at most one message originates at each node; all results can be easily generalized to thescenario in which multiple messages originate at each node.

TABLE IS

UMMARY OF CAPACITY REGION DEFINITIONS R V ( N , n, ǫ, k ) Finite blocklength rate region for network N n Blocklength ǫ Average probability of error k Number of bits carried by edge ( a, b ) in the modiﬁed network as shown in Fig. 2. If omitted thenthe network is unmodiﬁed (i.e., k = 0 ) V Set of nodes in N connected to extra nodes a and b . If omitted then V = [1 : d ] ; i.e., a and b connect to all nodes C V ( N , ( ǫ n ) n , ( k n ) n ) Asymptotic capacity region for network N ( ǫ n ) n Probability of error sequence as a function of blocklength n . If replaced by + then asymptoticallyvanishing error probability ( k n ) n Bit-capacity sequence of edge ( a, b ) as a function of blocklength n . If omitted then the network isunmodiﬁed (i.e., k n = 0 for all n ) V See above • For each time step t , a conditional probability measure P Y t ,...,Y dt | Y t − ,...,Y t − d ,X t ,...,X td . (7)Note that the channel outputs at time t depend on all previous inputs up to time t , and allprevious outputs up to time t − . Deﬁnition 1:

A network is memoryless and stationary if the probability measure in (7) canbe written as P Y t ,...,Y dt | X t ,...,X dt (8)and these distributions are the same for all t . Deﬁnition 2:

A network is deterministic if the channel outputs at time t are ﬁxed given thechannel inputs up to time t ; i.e., the conditional probability distribution in (7) takes values onlyin { , } . Deﬁnition 3:

A network is discrete if all input and output alphabets are ﬁnite sets. For any R = ( R , . . . , R d ) ∈ R d + , an ( R , n ) code consists of: While this is technically an incorrect use of “discrete”, we use it to mean “ﬁnite alphabet” as this is the usual convention inthe literature; see for example [29, p. 39]. • For each node i ∈ [1 : d ] and time t ∈ [1 : n ] , an encoding function φ it : [1 : 2 nR i ] × Y t − i → X i , (9) • For each i, j ∈ [1 : d ] where j ∈ D i , a decoding function ψ ij : [1 : 2 nR j ] × Y nj → [1 : 2 nR i ] . (10)Assume messages W i for i = 1 , . . . , d are independent and each uniformly distributed over [1 : 2 nR i ] . The channel input from node i at time t is given by X it = φ it ( W i , Y t − i ) . For j ∈ D i ,the estimate of W i at node j is given by ˆ W ij = ψ ij ( W j , Y nj ) . We write W for the completevector of messages, and ˆ W for the complete vector of message estimates. Given an ( R , n ) code,the average probability of error is P ( n )e = P ( ˆ W = W ) (11)where ˆ W = W denotes the event that there exists a node i and a message index j such that node i decodes message j incorrectly; that is, ˆ W ij = W j for any i ∈ [1 : d ] , j ∈ D i . For blocklength n and ǫ ∈ [0 , , let R ( N , n, ǫ ) ⊆ R d + be the set of rates R for which there exists an ( R , n ) code with average probability of error at most ǫ . Given a sequence ( ǫ n ) n where ǫ n ∈ [0 , forall n ∈ N , we say a rate vector R is achievable with respect to ( ǫ n ) n if there exists an integer n such that for all n ≥ n , R ∈ R ( N , n, ǫ n ) . The capacity region C ( N , ( ǫ n ) n ) is given bythe closure of the set of all achievable rate vectors with respect to ( ǫ n ) n . Alternatively, we maydeﬁne C ( N , ( ǫ n ) n ) = [ n ∈ N \ n ≥ n R ( N , n, ǫ n ) . (12)Throughout the paper, we use R to denote a ﬁnite blocklength region, and C to denote anasymptotic region. (Table I summarizes this notation.) Note that R ( N , n, ǫ ) is deﬁned as afunction of the single value ǫ , whereas C ( N , ( ǫ n ) n ) is a function of the inﬁnite sequence ( ǫ n ) n .In principle C ( N , ( ǫ n ) n ) is deﬁned for any sequence ( ǫ n ) n . However, it will be useful to restrictourselves to sequences for which − n log(1 − ǫ n ) has a limit; the following proposition, provedin Appendix A, shows that we may do this without loss of generality for memoryless stationarynetworks. We allow for any ǫ ∈ [0 , in our deﬁnitions for maximum generality, even though ǫ = 1 is a trivial case in which the rateregion is unbounded. Proposition 1:

Let N be any memoryless stationary network. For any α > , let ( ǫ n ) n and (˜ ǫ n ) n be two sequences where α = lim inf n →∞ − n log(1 − ǫ n ) = lim inf n →∞ − n log(1 − ˜ ǫ n ) . (13)Then C ( N , ( ǫ n ) n ) = C ( N , (˜ ǫ n ) n ) .As consequence of Proposition 1, for any sequence ( ǫ n ) n where α = lim inf n →∞ − n log(1 − ǫ n ) > , C ( N , ( ǫ n ) n ) = C ( N , (1 − exp {− nα } ) n ) . Thus, it is enough to focus on sequences ( ǫ n ) n where either ǫ n = 1 − exp {− nα } for some α > , or − log(1 − ǫ n ) = o ( n ) . Note that the latterincludes any sequence converging to a constant in [0 , .For ﬁxed ǫ , C ( N , ( ǫ ) n ) denotes the capacity region with asymptotic error probability ǫ . Withsome abuse of notation, deﬁne the usual asymptotically-zero-error capacity region as C ( N , + ) = \ ǫ> C ( N , ( ǫ ) n ) . (14)Equivalently we may write C ( N , + ) = [ ǫ n = o (1) C ( N , ( ǫ n ) n ) . (15) Remark 1:

Using average probability of error rather than maximal probability of error in ourdeﬁnition of capacity region is not merely convenient; it is critical to many of our results. Indeed,it is illustrated in [13], [15] that edge removal characteristics are very different with maximalprobability of error rather than average, and thus the relationship between edge removal andstrong converses in the maximal probability of error context is likely to be different.We proceed to deﬁne 7 different properties: 3 notions of a strong converse and 4 notionsof the edge removal property. The relationships that we will prove among these properties areshown in Fig. 1.

B. Strong ConversesDeﬁnition 4:

Strong converses are deﬁned in terms of whether, for a given constant γ > and a sequence ( ǫ n ) n , C ( N , ( ǫ n ) n ) ⊆ C ( N , + ) + [0 , γ ] d . (16)We say network N satisﬁes: • the extremely strong converse if for all γ > , (16) holds if − log(1 − ǫ n ) = γnK , where K is a positive constant depending only on the network. ExtremelyStrongConverseExponentiallyStrongConverseStrongConverse StrongEdgeRemovalWeakEdgeRemovalVery WeakEdgeRemovalExtremelyWeak EdgeRemoval

Fig. 1. Diagram showing the relationships between various strong converses and edge removal properties. Solid black linesrepresent implications that always hold (Remarks 3 and 5, and Theorem 5). All the dashed or dotted lines hold for deterministicnetworks (Theorem 7) but do not hold in general. The red dotted line does not hold even for noisy memoryless stationarynetworks (Remark 4). The black dash-dotted line holds for discrete memoryless stationary networks (Theorem 10). The bluedashed line holds for discrete memoryless stationary networks made up of independent point-to-point links (Theorem 14), andwe conjecture that it holds for all discrete memoryless stationary networks. • the exponentially strong converse if for all γ > , (16) holds for some ( ǫ n ) n where − log(1 − ǫ n ) = Θ( n ) . • the strong converse if for all γ > , (16) holds for some ( ǫ n ) n where − log(1 − ǫ n ) → ∞ . Remark 2:

Statements similar to (16) will occur throughout this paper; this condition may bealternatively written as follows: for any R ∈ C ( N , ( ǫ n ) n ) , there exists R ′ ∈ C ( N , + ) such that R i ≤ R ′ i + γ for all i ∈ [1 : d ] . Remark 3:

One can see immediately that the strong converses are ordered by strength; i.e.,the extremely strong converse implies the exponentially strong converse, which in turn impliesthe ordinary strong converse.The following proposition gives some equivalent deﬁnitions for each of these strong converse properties. It is proved in Appendix B. Proposition 2:

1) Network N satisﬁes the extremely strong converse if and only if there exists a constant K depending only on N such that either of the following hold:a) For any R / ∈ C ( N , + ) , any sequence of ( R , n ) codes has probability of error ( ǫ n ) n satisfying lim inf n →∞ − n log(1 − ǫ n ) ≥ βK (17)where β is the smallest number such that R ∈ C ( N , + ) + β .b) For any sequence ( ǫ n ) n where − ǫ n . = 2 − nα , C ( N , ( ǫ n ) n ) ⊆ C ( N , + ) + [0 , Kα ] d .2) Network N satisﬁes the exponentially strong converse if and only if either of the followinghold:a) For all R / ∈ C ( N , + ) , any sequence of ( R , n ) codes has probability of error ap-proaching 1 exponentially fast.b) For any sequence ( ǫ n ) n for which − log(1 − ǫ n ) = o ( n ) , C ( N , ( ǫ n ) n ) ⊆ C ( N , + ) .3) Network N satisﬁes the strong converse if and only if any of the following hold:a) For all R / ∈ C ( N , + ) , any sequence of ( R , n ) codes has probability of error ap-proaching 1 as n → ∞ .b) For all ǫ ∈ (0 , , C ( N , ( ǫ ) n ) = C ( N , + ) .c) There exists a sequence ( ǫ n ) n where ǫ n → and C ( N , ( ǫ n ) n ) = C ( N , + ) . Remark 4:

Exponential bounds on the probability of success for rates above capacity for point-to-point channels were ﬁrst considered in [22]. Later, [23] exactly characterized the optimalexponent of the success probability for rates above capacity. Similar results have been foundfor network problems in [24]–[27]. For point-to-point channels, [23] showed that for a discrete-memoryless point-to-point channel P Y | X with capacity C , for all R > C the optimal probabilityof error ǫ n satisﬁes − ǫ n . = 2 − α ( R ) n where α ( R ) = min Q X,Y h D (cid:0) Q Y | X k P Y | X | Q X (cid:1) + | R − I Q X,Y ( X ; Y ) | + i (18)where Q X and Q Y | X are the marginal and conditional distributions derived from Q X,Y respec-tively, I Q X,Y ( X ; Y ) is the mutual information between X and Y where ( X, Y ) ∼ Q X,Y , and |·| + represents the positive part. Intuitively, Q Y | X represents an empirical conditional distribution;correct decoding is possible if the channel behaves like one with capacity greater than R (i.e. when the second term in (18) is zero), and the ﬁrst term in (18) is the exponential rate of theprobability that channel P Y | X behaves like Q Y | X with input distribution Q X .This result constitutes an exponentially strong converse in our terminology, since α ( R ) > for all R > C , but interestingly it is not an extremely strong converse for many noisy channels.Note that an extremely strong converse is equivalent to dα ( R ) dR (cid:12)(cid:12) R = C > . However, as we show inthe following proposition (proved in Appendix C) this holds only for very specialized channels. Proposition 3:

Consider a discrete-memoryless point-to-point channel P Y | X with capacity C .Let P Y be the (unique) capacity-achieving output distribution. If log P Y | X ( y | x ) P Y ( y ) ≤ C for all x, y (19)then α ( R ) = R − C . Otherwise, dα ( R ) dR (cid:12)(cid:12) R = C = 0 .Examples of point-to-point channels that satisfy (19) include: • essentially noiseless channels, i.e., where C = log min {|X | , |Y |} , • completely noisy channels, i.e., where Y is independent of X , • noisy typewriter channels, i.e., where Y = X + Z with summation over some group G ,where Z is uniform on a subset of G and independent of X .Note also that (19) implies that the channel dispersion is 0 (cf. [17, Thm. 49]), but the converse isnot true. In particular, the channel dispersion is 0 if and only if there exists a capacity-achievinginput distribution P X such that log P Y | X ( y | x ) P Y ( y ) ≤ C for all y and all x with P X ( x ) > . However,(19) can fail to hold if log P Y | X ( y | x ) P Y ( y ) > C for some pair x, y even if P X ( x ) = 0 for all capacity-achieving input distributions P X . (For example, this is the case for channels termed exotic in[17].)However, most channels of interest do not satisfy (19), including binary symmetric channelsand binary erasure channels. Thus, while we are able to show equivalence between the extremelystrong converse and the strong edge removal property for deterministic networks (see Fig. 1),this equivalence cannot hold for many noisy networks, as the extremely strong converse simplydoes not hold. Fig. 2. The modiﬁed network for edge removal properties. Nodes a and b are connected to nodes in V (usually V is the set ofall nodes) by inﬁnite capacity links, while the link between them is limited to only k bits. Edge removal properties hold whenthe capacity region of this network is unchanged when the link between a and b is removed. C. Edge Removal Properties

For a subset of nodes

V ⊆ [1 : d ] and an integer k , we deﬁne a modiﬁed network N ( V , k ) ,illustrated in Fig. 2, as follows: Start with N , and add two nodes denoted a and b . For eachnode i ∈ V , add an inﬁnite capacity link from i to a , and an inﬁnite capacity link from b to i .Finally, add a bit-pipe from a to b that can noiselessly transmit k bits total across the n -lengthcoding block. In the case that k is not an integer multiple of n , this bit-pipe cannot be modeledas a stationary memoryless channel. Instead, we assume that the k bits are scheduled such thatafter t timesteps, ⌊ kn t ⌋ have been transmitted; that is, at time t , the link is allowed to transmitexactly (cid:22) kn t (cid:23) − (cid:22) kn ( t − (cid:23) (20)bits. Let R V ( N , n, ǫ, k ) be the set of rate vectors R such that there exists an ( R , n ) code on N ( V , k ) with average probability at most ǫ . That is, R V ( N , n, ǫ, k ) = R ( N ( V , k ) , n, ǫ ) . Givensequences ( ǫ n ) n and ( k n ) n where ǫ n ∈ [0 , and k n ∈ N , we deﬁne C V ( N , ( ǫ n ) n , ( k n ) n ) to These are special nodes in that messages do not originate at them. Thus the capacity region of N ( V , k ) has the samedimension as that of N . One could imagine other models, such as where the bit transmission schedule is ﬂexible but chosen in advance by the code,or where the schedule can be chosen at run-time. These model variations are unlikely to impact results, but here we adopt themore restrictive model. be the capacity region of the sequence of networks ( N ( V , k n )) n where ( k n ) n determines thedependence between the capacity of the edge ( a, b ) and the blocklength. Formally, we deﬁne C V ( N , ( ǫ n ) n , ( k n ) n ) = [ n ∈ N \ n ≥ n R V ( N , n, ǫ n , k n ) . (21)For the most part we are interested in the case that V = [1 : d ] , so we deﬁne for conve-nience R ( N , n, ǫ, k ) = R [1: d ] ( N , n, ǫ, k ) and C ( N , ( ǫ n ) n , ( k n ) n ) = C [1: d ] ( N , ( ǫ n ) n , ( k n ) n ) . Wefurther deﬁne C V ( N , + , ( k n ) n ) and C ( N , + , ( k n ) n ) analogously to (14)–(15). For any ( k n ) n ,it is certainly true that C ( N , ( ǫ n ) n ) ⊆ C ( N , ( ǫ n ) n , ( k n ) n ) . Note also that C ( N , ( ǫ n ) n , (0) n ) = C ( N , ( ǫ n ) n ) . Roughly, edge removal properties state that for small k , the capacity of network N ( V , k ) isnot too different from that of N . To be precise, we deﬁne four different versions of this propertyas follows. Deﬁnition 5:

Edge removal properties are deﬁned in terms of whether, for a given constant γ > and a sequence ( k n ) n , C ( N , + , ( k n ) n ) ⊆ C ( N , + ) + [0 , γ ] d . (22)We say network N satisﬁes: • the strong edge removal property if for all γ > , (22) holds for k n = γnK , where K is apositive constant depending only on the network. • the weak edge removal property if for all γ > , (22) holds for some k n = Θ( n ) . • the very weak edge removal property if for all γ > , (22) holds for some k n → ∞ . • the extremely weak edge removal property if for all γ > , (22) holds for all bounded k n . Remark 5:

One can again see immediately that the edge removal properties are orderedby strength; i.e., the strong property implies the weak property, which implies the very weakproperty, which implies the extremely weak property.The following proposition gives several alternative deﬁnitions of each of the edge removalproperties. It is proved in Appendix D.

Proposition 4:

1) The strong edge removal property holds if and only if there exists a ﬁnite positive constant K depending only on the network N such that for all δ > , C ( N , + , ( δn ) n ) ⊆ C ( N , + ) + [0 , Kδ ] d . (23)

2) The weak edge removal property holds if and only if, \ δ> C ( N , + , ( δn ) n ) = C ( N , + ) (24)and also if and only if [ k n = o ( n ) C ( N , + , ( k n ) n ) = C ( N , + ) . (25)3) The very weak edge removal property holds if and only if \ k n : k n →∞ C ( N , + , ( k n ) n ) = C ( N , + ) (26)and also if and only if \ ǫ> [ k ∈ N C ( N , ( ǫ ) n , ( k ) n ) = C ( N , + ) . (27)4) The extremely weak edge removal property holds if and only if [ k ∈ N C ( N , + , ( k ) n ) = C ( N , + ) . (28) Remark 6:

Most works on the edge removal problem (e.g., [1], [2]) consider removing anarbitrary edge from the network, rather than the speciﬁc topology shown in Fig. 2. Most similarto this topology is the notion of a super-source network in [30], which was deﬁned for sourcecoding problems as a network containing a node that can view all sources, and has links to eachother node. Another similar notion from the literature is that of the cooperation facilitator [9]–[14], which connects to the transmitting nodes (but not the receiving node) in a multiple-accessnetwork. We choose the topology in Fig. 2 because it ensures that the link that is added/removed isat least as useful as any other link. That is, when V = [1 : d ] , then node a has complete knowledgeof every signal sent in the network, so the link ( a, b ) can be used to simulate any other small-capacity link. In particular, for any network N ′ consisting of N supplemented by a link (ormultiple links) with total capacity at most k n bits, then C ( N ′ , ( ǫ n ) n ) ⊆ C ( N , ( ǫ n ) n , ( k n ) n ) . Oneexample of such a network N ′ is one that allows for rate-limited feedback. For this reason, oneconsequence of edge removal results are outer bounds on networks with rate-limited feedback. Remark 7:

The extremely weak edge removal property, wherein the extra edge carries abounded number of bits as the blocklength grows, appears in none of our results provingrelationships to strong converses. Nevertheless, we have chosen to include this deﬁnition becauseit is a natural one, and indeed the property seems tantalizingly likely to be true for all realistic systems. However, it was shown in [15] that for maximal error probability, there exists a networkwhere the extremely weak property does not hold. This again points to the contrast betweenaverage and maximal error probability. In light of our other results, the extremely weak propertyalso presents an interesting question: namely, is it equivalent to some version of a strong converse?Based on our results that for some networks, the very weak edge removal property is equivalentto the ordinary strong converse, if there is an equivalent converse to the extremely weak property,it appears that it would need to be weaker than the ordinary strong converse, but perhaps strongerthan the ordinary weak converse. No such property has occurred to us.III. D ERIVING E DGE R EMOVAL P ROPERTIES FROM S TRONG C ONVERSES

The following theorem states that each of the three strong converse properties implies one ofthe edge removal properties. This result holds for any causal network channel given by (7).

Theorem 5:

For any network N , the following hold:1) The strong converse implies very weak edge removal.2) The exponentially strong converse implies weak edge removal.3) The extremely strong converse implies strong edge removal.Statement (2) of this theorem was proved for noiseless networks in [16, Sec. 3.3]. Our proofuses essentially the same principle as theirs, namely converting a code on a network with anextra edge to a code on a network without one by ﬁxing a value sent along this edge, andassuming at all other nodes that this value was sent. The following lemma provides a reﬁnedversion of this argument, relating the achievable rate regions for the network with and withoutthe extra edge at ﬁnite blocklengths. Lemma 6:

For any integers n and k and any ǫ ∈ [0 , , R ( N , n, ǫ, k ) ⊆ R ( N , n, − (1 − ǫ )2 − k ) . (29) Proof:

Let R ∈ R ( N , n, ǫ, k ) , so there is an n -length code with rate vector R and probabilityof error at most ǫ on network N ([1 : d ] , k ) . We convert this code to one on network N as follows.Under the code on N ([1 : d ] , k ) , let X ab be the message sent on the link from node a to node b .Recall that X ab ∈ { , } k . Let E be the overall error event for network N ([1 : d ] , k ) . We have − ǫ ≤ P ( E c ) = X x ab ∈{ , } k P ( X ab = x ab ) P ( E c | X ab = x ab ) . (30) There must be some x ∗ ab ∈ { , } k for which P ( X ab = x ∗ ab ) P ( E c | X ab = x ∗ ab ) ≥ (1 − ǫ )2 − k . (31)Construct a code for network N that behaves exactly like the original code on network N ([1 : d ] , k ) , except that all nodes assume that node b received the signal x ∗ ab . Let P e be the probabilityof error for this code. Note that with probability P ( X ab = x ∗ ab ) , the code’s behavior will be justas if the code on N ([1 : d ] , k ) were in effect. Thus − P e ≥ P ( X ab = x ∗ ab ) P ( E c | X ab = x ∗ ab ) ≥ (1 − ǫ )2 − k . (32)Therefore R ∈ R ( N , n, − (1 − ǫ )2 − k ) . Proof of Theorem 5:

We ﬁrst show statement (1). Assume the strong converse holds. Thus \ ǫ> [ k ∈ N C ( N , ( ǫ ) n , ( k ) n ) ⊆ \ ǫ ∈ (0 , [ k ∈ N C ( N , (1 − (1 − ǫ )2 − k ) n ) (33) = \ ǫ> [ k ∈ N C ( N , + ) (34) = C ( N , + ) (35)where (33) follows from Lemma 6; (34) follows from the strong converse, because − (1 − ǫ )2 − k ∈ (0 , for any ǫ ∈ (0 , and k ∈ N ; and (35) follows because C ( N , + ) is closed.Therefore, very weak edge removal holds by the equivalent deﬁnition in (27) of Proposition 4.We now prove statement (2). Assume the exponentially strong converse holds. For any k n = o ( n ) , we have C ( N , + , ( k n ) n ) = \ ǫ> C ( N , ( ǫ ) n , ( k n ) n ) ⊆ \ ǫ> C ( N , (1 − (1 − ǫ )2 − k n ) n ) (36) ⊆ [ ǫ n : − log(1 − ǫ n )= o ( n ) C ( N , ( ǫ n ) n ) (37) ⊆ C ( N , + ) (38)where (36) follows from Lemma 6, (37) from the fact that k n = o ( n ) , and (38) from theexponentially strong converse. Therefore weak edge removal holds.We now prove statement (3). Assume the extremely strong converse holds. For any δ > wehave C ( N , + , ( δn ) n ) = \ ǫ> C ( N , ( ǫ ) n , ( δn ) n ) ⊆ \ ǫ> C ( N , (1 − (1 − ǫ )2 − δn ) n ) (39)where (39) follows from Lemma 6. Note that (1 − ǫ )2 − δn . = 2 − δn . Thus if R ∈ C ( N , + , δn ) ,then, by the extremely strong converse, R − Kδ ∈ C ( N , + ) for some constant K . Thereforestrong edge removal holds. IV. D ETERMINISTIC N ETWORKS

The following theorem states that for deterministic networks, each implication of Theorem 5is also an equivalence.

Theorem 7:

For any deterministic network N , the following hold:1) The very weak edge removal property holds if and only if the strong converse holds.2) The weak edge removal property holds if and only if the exponentially strong converseholds.3) The strong edge removal property holds if and only if the extremely strong converse holds.To prove Theorem 7, we begin with several lemmas. The ﬁrst is the well-known reverseMarkov inequality, which will be instrumental in proving that edge removal properties implystrong converses. Lemma 8:

Let X be a real-valued random variable where X ≤ x max a.s. For any τ ≤ E X , P ( X > τ ) ≥ E X − τx max − τ . (40)The following lemma provides the core result that is needed to prove Theorem 7. The proofis adapted from that of [31, Lemma 2]. Lemma 9:

Let N be a deterministic network. For any ǫ ∈ [0 , , any n ∈ N , and any ˜ ǫ ∈ (0 , , R ( N , n, ǫ ) ⊆ R ( N , n, ˜ ǫ, η (˜ ǫ, d ) − d log(1 − ǫ )) (41)where η (˜ ǫ, d ) = 3 d ( d + 1) + 3 d log ln 4 d ˜ ǫ . (42) Proof:

Let R ∈ R ( N , n, ǫ ) . That is, there exists a code with rate vector R and blocklength n achieving probability of error ǫ . The key to the proof is to show that if the rates are reducedslightly from those in R , then an extra edge allows achieving arbitrarily small probability of error. In particular, given a target probability of error ˜ ǫ , deﬁne a rate vector ˜ R = ( ˜ R , . . . , ˜ R d ) given by ˜ R i =  R i − kn , R i ≥ kn , R i < kn (43)where we choose with hindsight (recall d is the number of messages in the network) k = (cid:24) d + log ln 4 d ˜ ǫ − log(1 − ǫ ) (cid:25) . (44)We will proceed prove that ˜ R ∈ R ( N , n, ˜ ǫ, dk ) (45)by constructing a code of rate ˜ R on network N ([1 : d ] , dk ) . However, to prove the lemma weneed to show that R , rather than ˜ R , is contained in the right-hand side (RHS) of (41). Given(45) and that nR i − n ˜ R i ≤ k , we may simply expand the edge from node a to b to carry dk additional bits, adding k bits for each message, which implies R ∈ R ( N , n, ˜ ǫ, dk ) . (46)This is now enough to prove the lemma, since dk ≤ η (˜ ǫ, d ) − d log(1 − ǫ ) where η (˜ ǫ, d ) isdeﬁned in (42).We now prove (45). For i = 1 , . . . , d , let W i = [2 nR i ] be the message set for the i th messageof the original code of rate R and probability of error ǫ , and let W = d Y i =1 W i (47)be the set of complete message vectors w = ( w , . . . , w d ) . Let R = P i R i , so |W| = 2 nR .Since the network is deterministic and the code is ﬁxed, whether or not an error occurs dependsentirely on the message vector w ∈ W that is chosen. Let Γ be the subset of W of messagevectors that do not lead to errors. Thus the probability of error is precisely − − nR | Γ | . By theassumption that the probability of error is at most ǫ , we have that | Γ | ≥ |W| (1 − ǫ ) = 2 nR (1 − ǫ ) . (48)Recall that ˜ R i = 0 if nR i < k , so this message is not signiﬁcant. For ease of notation, weassume for now that nR i ≥ k for all messages i , so that ˜ R i = R i − kn . We employ a versionof a random binning argument. For each i , randomly choose the sets P i (1) , . . . , P i (2 n ˜ R i ) (49) to be a partition of W i where |P i ( ˜ w i ) | = 2 k for all ˜ w i ∈ [1 : 2 n ˜ R i ] , such that all such partitionsare equally likely. Furthermore, let P ( ˜ w ) for ˜ w = ( ˜ w , . . . , ˜ w d ) be the set of message vectors w ∈ W such that w i ∈ P i ( ˜ w i ) for all i ∈ [1 : d ] . Given these partitions, the code proceeds asfollows. Messages ˜ W , . . . , ˜ W d are all transmitted to node a . Node a then chooses a messagevector W = ( W , . . . , W d ) from the set Γ ∩ P ( ˜ W ) in an arbitrary manner. If this set is empty,then we declare an error. For each i , let I i ∈ { , . . . , k } be the index of W i in the set P i ( ˜ W i ) .Node a determines I i for each i and transmits ( I , . . . , I d ) to node b . Note that the number ofbits required is dk .At the originating source node for message i , W i can be determined from ˜ W i and I i . Subse-quently, the code proceeds as if W were the true message vector. When a destination node j produces a message estimate ˆ W ij , it constructs the ﬁnal message estimate as the c ˜ W ij ∈ [1 : 2 n ˜ R i ] such that ˆ W ij ∈ P i (cid:16)c ˜ W ij (cid:17) . Since by assumption W ∈ Γ , there is no error as long as Γ ∩ P ( ˜ W ) is not empty.For ˜ w = ( ˜ w , . . . , ˜ w d ) let q ( ˜ w ) , P (Γ ∩ P ( ˜ w ) = ∅ ) (50)where the probability is with respect to the random choice of partitions P i . We proceed to showthat q ( ˜ w ) ≤ ˜ ǫ for all ˜ w . Thus, the probability of error averaged over both the message vector W and the random choice of partitions is at most ˜ ǫ . This proves that there exists at least onedeterministic code with average probability of error ˜ ǫ .For each i ∈ [1 : d − , deﬁne for all w , . . . , w i − , the set A i ( w , . . . , w i − ) = n w i : |{ ( w i +1 , . . . , w d ) : ( w , . . . , w d ) ∈ Γ }| ≥ (1 − ǫ )2 n ( R i +1 + ··· + R d ) − i o . (51)Moreover, deﬁne A d ( w , . . . , w d − ) = { w d : ( w , . . . , w d ) ∈ Γ } . (52)We claim that for all i ∈ [1 : d ] , if w , . . . , w i − is such that w i − ∈ A i − ( w , . . . , w i − ) , then |A i ( w , . . . , w i − ) | ≥ (1 − ǫ )2 nR i − i . (53)To prove this for i ∈ [1 : d − , assume w i − ∈ A i − ( w , . . . , w i − ) . Deﬁne the random variable X ( w , . . . , w i − ) = |{ ( w i +1 , . . . , w d ) : ( w , . . . , w i − , W i , w i +1 , . . . , w d ) ∈ Γ }| . (54) where as usual W i is uniformly distributed on [1 : 2 nR i ] . Note that E X ( w , . . . , w i − ) = 2 − nR i X w i |{ ( w i +1 , . . . , w d ) : ( w , . . . , w d ) ∈ Γ }| (55) = 2 − nR i |{ ( w i , . . . , w d ) : ( w , . . . , w d ) ∈ Γ }| (56) ≥ (1 − ǫ )2 n ( R i +1 + ··· + R d ) − ( i − (57)where the inequality follows from the assumption that w i − ∈ A i − ( w , . . . , w i − ) . Hence |A i ( w , . . . , w i − ) | = 2 nR i P (cid:16) X ( w , . . . , w i − ) ≥ (1 − ǫ )2 n ( R i +1 + ··· + R d ) − i (cid:17) (58) ≥ nR i E X ( w , . . . , w i − ) − (1 − ǫ )2 n ( R i +1 + ··· + R d ) − i n ( R i +1 + ··· + R d ) − (1 − ǫ )2 n ( R i +1 + ··· + R d ) − i (59) ≥ nR i (1 − ǫ )2 n ( R i +1 + ··· + R d ) − i n ( R i +1 + ··· + R d ) (60) = (1 − ǫ )2 nR i − i (61)where (59) follows from Lemma 8 and the fact that X ( · ) ≤ n ( R i +1 + ··· + R d ) , and (60) followsfrom (57). This proves (53) for i ∈ [1 : d − . For i = d , note that if w d − ∈ A d − ( w , . . . , w d − ) ,then by the deﬁnitions of A d − and A d , |A d ( w , . . . , w d − ) | = |{ w d : ( w , . . . , w d ) ∈ Γ }| ≥ (1 − ǫ )2 nR d − ( d − > (1 − ǫ )2 nR d − d . (62)This proves (53) for i = d .Fix ˜ w = ( ˜ w , . . . , ˜ w d ) . For each i = 1 , . . . , d , deﬁne Q i = { ( w , . . . , w i ) : w j ∈ P j ( ˜ w j ) ∩ A j ( w , . . . , w j − ) for all j ≤ i } . (63)Note that for w ∈ Q d , certainly w i ∈ P i ( ˜ w i ) for all i ∈ [1 : d ] , so w ∈ P ( ˜ w ) . Moreover, since w d ∈ A d ( w , . . . , w d − ) , by deﬁnition w ∈ Γ . Thus Q d ⊆ Γ ∩ P ( ˜ w ) , so q ( ˜ w ) ≤ P ( Q d = ∅ ) ≤ d X i =1 P ( Q i = ∅|Q i − = ∅ ) . (64)To upper bound P ( Q i = ∅|Q i − = ∅ ) , suppose Q i − = ∅ , so there exists some ( w , . . . , w i − ) ∈Q i − . If Q i is empty, then P i ( ˜ w i ) ∩ A i ( w , . . . , w i − ) = ∅ . Recall that P i ( ˜ w i ) is one set of arandom partition of W i , which is chosen independently of w , . . . , w i − . In particular, P i ( ˜ w i ) ischosen uniformly among all subsets of W i = [1 : 2 nR i ] of size k , so P ( P i ( ˜ w i ) ∩ A i ( w , . . . , w i − ) = ∅ ) = (cid:0) nRi −|A i ( w ,...,w i − ) | k (cid:1)(cid:0) nRi k (cid:1) . (65) Since by assumption ( w , . . . , w i − ) ∈ Q i − , we have w i − ∈ A i − ( w , . . . , w i − ) , so we mayapply (53) to bound P ( Q i = ∅|Q i − = ∅ ) ≤ (cid:0) nRi − (1 − ǫ )2 nRi − i k (cid:1)(cid:0) nRi k (cid:1) . (66)Thus q ( ˜ w ) ≤ d X i =1 (cid:0) nRi − (1 − ǫ )2 nRi − i k (cid:1)(cid:0) nRi k (cid:1) (67) = d X i =1 (2 nR i − (1 − ǫ )2 nR i − i )!(2 nR i − (1 − ǫ )2 nR i − i − k )! · (2 nR i − k )!(2 nR i )! (68) ≤ d X i =1 (2 nR i − (1 − ǫ )2 nR i − i ) k (2 nR i − k ) k (69) = d X i =1 (1 − (1 − ǫ )2 − i ) k (1 − k − nR i ) k (70) ≤ d X i =1 e − (1 − ǫ )2 k − d (1 − k − nR i ) k (71) ≤ d X i =1 ˜ ǫ d (1 − k − nR i ) − k (72) ≤ d X i =1 ˜ ǫ d (1 − − k ) − k (73) ≤ ˜ ǫ (74)where (69) follows since a ! /b ! ≤ a a − b for integers a, b , (71) follows since (1 + k ) ≤ e x , (72)follows from the choice of k in (44), (73) follows by the assumption that R i ≥ kn for all i , and(74) follows since (1 − − k ) − k ≤ for any k ≥ . This last fact can be seen by noting that f ( x ) = − x ln(1 − x − ) is decreasing in x , which holds because its derivative is given by f ′ ( x ) = − ln(1 − x − ) − x − (cid:18) x − (cid:19) − x − ≤ . (75) Proof of Theorem 7:

Theorem 5 proves that each strong converse property implies thecorresponding edge removal property, so we only need to prove the opposite directions.Suppose the very weak edge removal property holds. For any constant ǫ , applying Lemma 9gives C ( N , ( ǫ ) n ) ⊆ \ ˜ ǫ> C ( N , (˜ ǫ ) n , ( η (˜ ǫ, d ) − d log(1 − ǫ )) n ) (76) ⊆ \ ˜ ǫ> [ k ∈ N C ( N , (˜ ǫ ) n , ( k ) n ) . (77) = C ( N , + ) (78)where the last equality holds by very weak edge removal. Therefore the strong converse holds.Now suppose the weak edge removal property holds. For any sequence ( ǫ n ) n where − log(1 − ǫ n ) = o ( n ) , applying Lemma 9 gives C ( N , ( ǫ n ) n ) ⊆ \ ˜ ǫ> C ( N , (˜ ǫ ) n , ( η (˜ ǫ, d ) − d log(1 − ǫ n )) n ) (79) ⊆ \ ˜ ǫ> C ( N , (˜ ǫ ) n , ( √ n − d log(1 − ǫ n )) n ) (80) = C ( N , + , ( √ n − d log(1 − ǫ n )) n ) (81) = C ( N , + ) (82)where (80) follows since for any ˜ ǫ and d , η (˜ ǫ, d ) ≤ √ n for sufﬁciently large n ; and (82) followsfrom weak edge removal, since √ n − d log(1 − ǫ n ) = o ( n ) . Therefore the exponentially strongconverse holds.Finally, suppose the strong edge removal property holds. For any α > , let ǫ n where − ǫ n . =2 − nα . Applying Lemma 9 gives C ( N , ( ǫ n ) n ) = C ( N , (1 − − nα ) n ) (83) ⊆ \ ˜ ǫ> C ( N , (˜ ǫ ) n , ( η (˜ ǫ, d ) + 3 dαn ) n ) (84) ⊆ \ ˜ ǫ> C ( N , (˜ ǫ ) n , ((3 d + 1) αn ) n ) (85) = C ( N , + , ((3 d + 1) αn ) n ) (86) ⊆ C ( N , + ) + [0 , K (3 d + 1) α ] d (87)where (83) follows from Prop. 1, (84) follows from Lemma 9, (85) follows because η (˜ ǫ, d ) ≤ αn for sufﬁciently large n , (86) follows by the deﬁnition of C ( N , + , ( k n ) n ) , and (87) follows bythe equivalent form of the strong edge removal property in (23), where K is a ﬁnite positiveconstant depending only on the network. Therefore, this network satisﬁes equivalent form of theextremely strong converse in Prop. 2 part (1b). V. D

ISCRETE S TATIONARY M EMORYLESS N ETWORKS

The following is our main theorem for discrete stationary memoryless networks, connectingthe exponentially strong converse to the weak edge removal property. In addition, we show thatboth these properties are equivalent to an even weaker form of the weak edge removal property—namely, where the nodes a and b connect only to transmitting nodes ; i.e. those nodes i where X i = ∅ . (Recall the deﬁnition C V ( N , ( ǫ n ) n , ( k n ) n ) being the capacity region of the networkwith nodes a and b connected only to nodes in V .) This is a generalization of the “cooperationfacilitator” model from [9]–[14], which connected only to the transmitters in a multiple-accesschannel, but not the receiver. The intuition behind connecting only to transmitting nodes is thatthe extra edge is useful when encoding but not decoding . The reason is that when decoding,a node attempts to reconstruct a message, which is available exactly at the message’s sourcenode. Thus, any small amount of information sent from the omniscient node a could equallywell be sent from the source node. However, when encoding, the “ideal” transmission may be afunction of multiple messages, which are simultaneously available only at the ominscient node a . Therefore, even a small capacity link from a to b could in principle provide signiﬁcant rategain by connecting to an encoding node. However, if a node does not transmit, it only decodesand never encodes, so the connection from nodes a and b is not helpful. Theorem 10:

For any discrete stationary memoryless network N , the following three statementsare equivalent:1) The exponentially strong converse holds.2) The weak edge removal property holds.3) For all γ > , C V ( N , + , ( k n ) n ) ⊆ C ( N , + ) + [0 , γ ] d (88)for some sequence k n = Θ( n ) , where V is the set of nodes i such that X i = ∅ .Observe that statement 1 of the theorem implies statement 2 by Theorem 5. Note that statement3 is identical to the deﬁnition of the weak edge removal, except that the left-hand side (LHS)of (88) is C V ( N , + , ( k n ) n ) instead of C ( N , + , ( k n ) n ) as in (22); i.e., in the modiﬁed network,nodes a and b connect only to the set V of transmitting nodes rather than all nodes. Since for any V ⊆ [1 : d ] , C V ( N , + , ( k n ) n ) ⊆ C ( N , + , ( k n ) n ) , statement 2 of the theorem implies statement3. Hence it remains only to show that statement 3 implies statement 1. The main tool in doing sowill be a modiﬁed version of the blowing-up lemma. The blowing-up lemma, originally proved in [32] (see also [28], [33]), has been used in the proof of numerous strong converse results. Insome sense our result is a generalization of this technique. The traditional blowing-up lemma isstated as follows. Lemma 11:

Let X n ∈ X n be a sequence of independent random variables. Fix A ⊆ X n where P X n ( A ) = exp {− nγ n } for a sequence γ n → . For any ℓ , deﬁne the blown-up version of A as A ℓ = { x n : d H ( x n , y n ) ≤ ℓ for some y n ∈ A} (89)where d H is the Hamming distance. There exists a sequence δ n → where P X n ( A nδ n ) → . (90)The following is a causal version of the blowing-up lemma. It is stronger than the usualblowing-up lemma, but it follows from a slight modiﬁcation of Marton’s proof of the blowing-uplemma in [28]. One may view this lemma as a causal version of a transportation-cost inequality[33]. Lemma 12:

Let X n ∈ X n be a random sequence, not necessarily independent. Fix A ⊆ X n .There exists a sequence of conditional distributions P Z t | Y t ,Z t − for t = 1 , . . . , n such that, if welet Y n ∈ X n , Z n ∈ X n have joint distribution P Y n ,Z n ( y n , z n ) = n Y t =1 P X t | X t − ( y t | z t − ) P Z t | Y t ,Z t − ( z t | y t , z t − ) (91)then Z n ∈ A almost surely, and E d H ( Y n , Z n ) ≤ s n e log 1 P X n ( A ) . (92) Proof:

Let ˜ X n be a random sequence with distribution that of X n conditioned on the set A . That is, P ˜ X n ( x n ) =  P Xn ( x n ) P Xn ( A ) x n ∈ A x n / ∈ A . (93)For any t ∈ [1 : n ] and z t − ∈ X t − , by [34, Theorem 1] there exists a pair of random variables X t ( z t − ) , ˜ X t ( z t − ) with joint distribution P X t ( z t − ) , ˜ X t ( z t − ) such that the marginal distributionssatisfy P X t ( z t − ) = P X t | X t − = z t − , (94) P ˜ X t ( z t − ) = P ˜ X t | ˜ X t − = z t − (95) and their joint distribution satisﬁes P ( X t ( z t − ) = ˜ X t ( z t − )) = d TV (cid:0) P X t | X t − = z t − , P ˜ X t | ˜ X t − = z t − (cid:1) . (96)We now deﬁne P Z t | Y t ,Z t − ( z t | y t , z t − ) = P ˜ X t ( z t − ) | X t ( z t − ) ( z t | y t ) . (97)Let Y n , Z n have distribution given by (91), where P Z t | Y t ,Z t − is deﬁned in (97). Note that P Y t ,Z t | Z t − ( y t , z t | z t − ) = P X t | X t − ( y t | z t − ) P Z t | Y t ,Z t − ( z t | y t , z t − ) (98) = P X t ( z t − ) ( y t ) P ˜ X t ( z t − ) | X t ( z t − ) ( z t | y t ) (99) = P X t ( z t − ) , ˜ X t ( z t − ) ( y t , z t ) (100)where (98) follows from (91), (99) follows from (94) and (97), and (100) follows from simplerules about joint distributions. Thus P Z t | Z t − ( z t | z t − ) = X y t P Y t ,Z t | Z t − ( y t , z t | z t − ) (101) = X y t P X t ( z t − ) , ˜ X t ( z t − ) ( y t , z t ) (102) = P ˜ X t ( z t − ) ( z t ) (103) = P ˜ X t | ˜ X t − ( z t | z t − ) (104)where (102) holds by (100), (103) holds simply because the summation in (102) represents themarginal distribution of ˜ X t ( z t − ) , and (104) holds by (95). Thus Z n and ˜ X n have the samedistribution. In particular, since by construction ˜ X n ∈ A almost surely, also Z n ∈ A almostsurely. We now have E d H ( Y n , Z n ) = n X t =1 P ( Y t = Z t ) (105) = n X t =1 X z t − P Z t − ( z t − ) X y t = z t P Y t ,Z t | Z t − ( y t , z t | z t − ) (106) = n X t =1 X z t − P Z t − ( z t − ) X y t = z t P X t ( z t − ) , ˜ X t ( z t − ) ( y t , z t ) (107) = n X t =1 X z t − P Z t − ( z t − ) P ( X t ( z t − ) = ˜ X t ( z t − )) (108) = n X t =1 X z t − P Z t − ( z t − ) d TV (cid:0) P X t | X t − = z t − , P ˜ X t | ˜ X t − = z t − (cid:1) (109) ≤ n X t =1 X z t − P Z t − ( z t − ) r

12 log e D ( P ˜ X t | ˜ X t − = z t − k P X t | X t − = z t − ) (110) ≤ n vuut e ) n n X t =1 X z t − P Z t − ( z t − ) D ( P ˜ X t | ˜ X t − = z t − k P X t | X t − = z t − ) (111) = vuut n e n X t =1 X z t − P ˜ X t − ( z t − ) D ( P ˜ X t | ˜ X t − = z t − k P X t | X t − = z t − ) (112) = r n e D ( P ˜ X n k P X n ) (113) = s n e log 1 P X n ( A ) (114)where (107) holds by (100), (109) holds by (96), (110) holds by Pinsker’s inequality, (111) holdsby concavity of the square root, (112) holds because Z n and ˜ X n have the same distribution,(113) holds by the chain rule for relative entropy, and (114) holds because, by (93), P ˜ X n ( ˜ X n ) P X n ( ˜ X n ) = 1 P X n ( A ) a.s. (115) Remark 8:

Lemma 11 can be derived from Lemma 12 as follows. If in Lemma 12, X n is asequence of independent random variables, then by (91), Y n has the same distribution as X n .Thus P X n ( A ℓ ) = P Y n ( A ℓ ) (116) ≥ P ( d H ( Y n , Z n ) ≤ ℓ ) (117) ≥ − ℓ E d H ( Y n , Z n ) (118) ≥ − ℓ s n e log 1 P X n ( A ) (119)where (117) holds because Z n ∈ A almost surely, (118) holds by Markov’s inequality, and in(119) we have applied (92). Assuming P X n ( A ) = exp {− nγ n } where γ n → , if we choose, forexample, δ n = γ / n , we have δ n → and P X n ( A nδ n ) ≥ − γ / n √ e → . (120)This proves Lemma 11.With Lemma 12 in hand, we complete the proof of Theorem 10 with the following lemma. Lemma 13:

For any discrete stationary memoryless network N , statement 3 of Theorem 10implies statement 1. Proof:

By the same argument as in the proof of Proposition 4, statement 3 of Theorem 10is equivalent to \ δ> C V ( N , + , ( δn ) n ) = C ( N , + ) . (121)where again V is the set of transmitting nodes. By Proposition 2, the exponentially strongconverse holds if and only if, for any sequence ( ǫ n ) n where − log(1 − ǫ n ) = o ( n ) , C ( N , ( ǫ n ) n ) ⊆C ( N , + ) . Thus, to prove the lemma it is enough to show that for any ( ǫ n ) n where − log(1 − ǫ n ) = o ( n ) , and any δ > , C ( N , ( ǫ n ) n ) ⊆ C V ( N , + , ( δn ) n ) . Let R be achievable with respect to ǫ n .Thus for sufﬁciently large n there exists an n -length code with average probability of error at most ǫ n . Let ( φ it , ψ ij ) be the encoding/decoding functions for this code (see (9)–(10)). We describea new code, illustrated in Fig. 3, achieving the same rate vector with vanishing probability oferror on the network N ( V , δn ) . Note that for any i ∈ V c , we have X i = ∅ , so if R i > theprobability of success would be exponentially small; thus we must have R i = 0 . Network stacking:

We adopt the notion of network stacking from [35]. The motivation for ouruse of network stacking is that it allows us to convert an arbitrary coding operation at a singletime instance into a coding operation across a long block, thereby taking advantage of the law oflarge numbers. In particular, we construct N independent copies of the original n -length code,each with its own messages, using a total of nN channel uses. Each copy is referred to as a“layer”, indexed by an integer ℓ ∈ [1 : N ] . Unlike a block Markov approach [36], in which onewould transmit an n -length block corresponding to the original code in sequence, in the networkstacking approach we transmit N copies of a single time instance t ∈ [1 : n ] of the original codebefore moving on to the next one. Thus coding can be done “across the layers”, using the factthat the N copies of any symbol are i.i.d., while maintaining the causal structure of the originalcode.We use underlines to indicate symbols on the stacked network. In particular, X it ( ℓ ) is thetransmitted symbol from node i at time t in layer ℓ ; X ni ( ℓ ) refers to the n -length sequenceof symbols in layer ℓ ; X it refers to the N -length sequence of symbols at time t in all layers; X ni refers to the full nN -length sequence of all layers and time instances. We deﬁne Y it ( ℓ ) ,etc. similarly. Moreover, W i ( ℓ ) is the message originating at node i in layer ℓ , and W i is thecomplete vector of messages originating at node i across all N layers. T r a n s m i ss i on P h a s e T r a n s m i ss i on P h a s e T r a n s m i ss i on P h a s e T r a n s m i ss i on P h a s e M e ss a g e C oo r d i n a ti on P h a s e H a s h i ng P h a s e C o rr ec ti on P h a s e C o rr ec ti on P h a s e C o rr ec ti on P h a s e C o rr ec ti on P h a s e . . .. . . n Number oftimesteps:

Fig. 3. Summary of the procedure to convert a code with probability of error ǫ n to one with vanishing probability of error onthe network with an extra edge. Each timestep of the original code is copied N times into a transmission phase, followed bya subsequent correction phase that replaces some of the received signals. Prior to the n transmission and correction phases, amessage coordination phase ensures that only “good” message vectors are used; subsequently a hashing phase is used to ensureall nodes can decode. Code phases:

Given the original n -length code, we construct an N -fold stacked code asfollows, where the precise dependence between n and N is to be determined. The code consistsof n +2 phases, each consisting of a number of timesteps. These phases are visualized in Fig. 3.First we have a message coordination phase, followed by n transmission phases alternating with n correction phases, and concluded with a hashing phase. In the message coordination phase,nodes coordinate to choose a message vector in each layer with a relatively large probabilityof success; this is done in exactly the same manner as for deterministic networks in Lemma 9. Each transmission phase corresponds to one timestep t ∈ [1 : n ] in the original code: the layersact independently, each performing the coding functions from the original code at time t . In thefollowing correction phase, node a transmits data to node b , describing replacements for certainreceived data in sub-network V . Node b then disperses this data to the nodes in V ; in subsequenttransmission phases, nodes in V use this replaced data in their coding operations. In the ﬁnalhashing phase, hashes of all messages are dispersed to all nodes, which allows nodes in V c todecode. This last phase is necessary because nodes a and b do not connect directly to nodes in V c ; thus the correction approach applied to the rest of the network does not work here, sincenode a does not know what signals were received in V c . Instead, hashes are used to correct anyremaining errors in messages decoded in V c .The message coordination phase consists of O ( N ( − log(1 − ǫ n ) + log n )) timesteps. Eachtransmission phase consists of exactly N timesteps, since each layer transmits exactly once.Correction phases have variable lengths, depending on how much correction data is required,but a total of N nγ n timesteps are allocated for all correction phases, where γ n = (cid:18) − log − ǫ n n (cid:19) / . (122)The hashing phase consists of O ( √ γ n nN ) timesteps. Note that in total, the transmission phasesconsist of nN timesteps. Recalling that − log(1 − ǫ n ) = o ( n ) , γ n → as n → ∞ , so all otherphases consist of a negligible number of timesteps. Message coordination phase:

For each message vector w of the original code, let P c ( w ) bethe probability of correctly decoding w . Let Γ = (cid:26) w : P c ( w ) ≥ − ǫ n (cid:27) . (123)Deﬁning R = P di =1 R i , we may lower bound the cardinality of Γ by | Γ | = 2 nR P (cid:18) P c ( W ) ≥ − ǫ n (cid:19) (124) ≥ nR E P c ( W ) − − ǫ n − − ǫ n (125) ≥ nR (cid:20) (1 − ǫ n ) − − ǫ n (cid:21) (126) = 2 nR − ǫ n (127)where (125) holds by Lemma 8 and the fact that P c ( W ) ≤ , and (126) holds since the averageprobability of error is at most ǫ n . In the message coordination phase, we use an identical outer code as in Lemma 9 to ensurethat, with high probability, only message vectors in Γ are ever used. By the same binningargument as in the proof of Lemma 9, this requires only O ( − log(1 − ǫ n ) + log n ) bits on thelink ( a, b ) for each layer. Note that nodes a and b are only required to contact the nodes in V ,since nodes in V c have no message originating at them. We may therefore assume throughoutthe rest of this argument that W ( ℓ ) ∈ Γ for each ℓ ∈ [1 : N ] . Correction codebook:

Let P c ( w , y n V ) be the probability of correct decoding given messagevector w , and channel outputs y n V at nodes V . That is, P c ( w , y n V ) = P ( ˆ W = w | W = w , Y n V = y n V ) (128)where again ˆ W is the complete vector of message estimates. Since encoding and decodingfunctions are assumed to be deterministic (cf. (9)–(10)), channel inputs X n V are deterministicfunctions of Y n V and W . Thus, the only randomness in the probability in (128) are the channeloutputs Y n V c given the inputs X n V . Recalling that X i = ∅ for i ∈ V c , Y n V c is an independentsequence given X n V . For each message vector w of the original n -length code, let Q ( w ) = (cid:26) y n V : P c ( w , y n V ) ≥ − ǫ (cid:27) . (129)Note that for any w ∈ Γ , E ( P c ( w , Y n V ) | W = w ) = P ( ˆ W = w | W = w ) (130) = P c ( w ) (131) ≥ − ǫ n . (132)Thus, applying Lemma 8 to the random variable P c ( w , Y n V ) gives P Y n V | W = w ( Q ( w )) ≥ − ǫ n . (133)We now apply Lemma 12 to the distribution P Y n V | W = w and the set Q ( w ) to ﬁnd conditionaldistributions P Z V ,t | Y V ,t ,Z V ,t for all t = [1 : n ] . Note that these distributions depend on the messagevector w . For each y V ,t ∈ Y V and z t − ∈ Y t − V , independently draw f t ( w , y V ,t , z t − V ) ∼ P Z V ,t | Y V ,t ,Z t − V . (134)These functions constitute a codebook known to all nodes. Hashing codebook:

For each i ∈ V and each w i ∈ [1 : 2 nR i ] N , independently and uniformlydraw g i ( w i ) from [1 : 2 nN √ γ n ] . These hashing functions also constitute a codebook known to allnodes. Transmission phases:

Before the transmission phase at time t , each node i ∈ V has determined Z t − i ∈ Y t − i , which represent the corrected versions of its received signals (see description belowof the correction phases). For each ℓ ∈ [1 : N ] , node i determines and transmits X i,t ( ℓ ) = φ it ( W i ( ℓ ) , Z t − i ) (135)For each i ∈ [1 : d ] , let Y i,t ( ℓ ) be the corresponding received signals. Correction phases:

In the correction phase after the transmission phase at time t , node a learns Y i,t from each i ∈ V , and determines, for each ℓ ∈ [1 : N ] , Z V ,t ( ℓ ) = f t ( W ( ℓ ) , Y V ,t ( ℓ ) , Z t − V ( ℓ )) . (136)For each ℓ for which Z V ,t ( ℓ ) = Y V ,t ( ℓ ) , node a transmits to node b a bit string with followedby ⌈ log N |Y |⌉ bits identifying the layer ℓ ∈ [1 : N ] as well as the value of Z V ,t ( ℓ ) ∈ Y V . Afterdoing this for each layer where Z V ,t ( ℓ ) = Y V ,t ( ℓ ) , node a transmits the stop bit , signaling thatall nodes should proceed to the next transmission phase. Node b then forwards this data to eachnode i ∈ V . For all layers ℓ for which no correcting signal was sent, each node i ∈ V simplysets Z it ( ℓ ) = Y it ( ℓ ) . Hashing phase:

Node a computes g i = g i ( w i ) for all i ∈ V , and transmits these values tonode b , which subsequently disperses them to nodes in V . Note that these hashes consist ofa total of d √ γ n nN bits, which is sub-linear in nN . Thus they can be transmitted over thelink ( a, b ) as long as δ > . For each node i ∈ V c , if there exists a node j ∈ V where thepoint-to-point channel from X j to Y i has positive capacity, then we use a point-to-point channelcode to transmit the hashes from node j to node i . If there is no such node j ∈ V , then allreceived signals at node i are independent of the rest of the network, so node i cannot decodeany messages; in particular, if i ∈ D k for any k ∈ [1 : d ] , it must be that R k = 0 . Since thehashes occupy a sub-linear number of bits, transmitting these hashes to each node in V c takesa sub-linear number of timesteps, and can be done with arbitrarily small probability of error. One could also compute the hash for message i directly at node i , and distribute the hash to all decoder nodes from there.We choose to compute the hash at node a makes merely to make distribution of the hashes simpler to describe. Decoding:

For each i, j ∈ V where j ∈ D i and each ℓ ∈ [1 : N ] , node j determines ˆ W ij ( ℓ ) = ψ ij ( W j ( ℓ ) , Z nj ( ℓ )) . (137)Now consider i ∈ [1 : d ] and j ∈ V c ∩ D i and each i ∈ [1 : d ] where j ∈ D i . Given Y nj and g i ,ﬁnd the unique ˆ w i where g i = g i ( ˆ w i ) and there exists ˜ y ni where ψ ij ( W j ( ℓ ) , ˜ y nj ( ℓ )) = ˆ w i ( ℓ ) foreach ℓ ∈ [1 : N ] and d H ( Y nj , ˜ y nj ) ≤ N nγ n . (138)If there is no such ˆ w i or more than one, declare an error. Probability of error analysis:

Consider the following error events E = { number of timesteps used in correction phases exceeds N nγ n } (139)and, for i ∈ [1 : d ] and j ∈ V c ∩ D i , E ij = (cid:8) ψ ij ( W j ( ℓ ) , ˜ y nj ( ℓ )) = W i ( ℓ ) for some ℓ ∈ [1 : N ] , for all ˜ y nj where d H ( Y nj , ˜ y nj ) ≤ N nγ n (cid:9) , (140) E ij = (cid:8) ψ ij ( W j ( ℓ ) , ˜ y nj ( ℓ )) = w ′ i ( ℓ ) for all ℓ ∈ [1 : N ] , for some w ′ i = W i where g i ( w ′ i ) = g i ( W i ) and ˜ y nj where d H ( Y nj , ˜ y nj ) ≤ N nγ n (cid:9) . (141)Note that as long as E does not occur, then by Lemma 12, Z n V ( ℓ ) ∈ Q ( W ( ℓ )) for all ℓ . By thedeﬁnition of Q ( w ) , this ensures that W ji = w i for all j ∈ [1 : d ] and i ∈ V . Events E ij , E ij cover all errors that can occur at nodes in V c . Hence the probability of error of the overall code,averaged over random coding choices, is P e ≤ P  E ∪ [ i ∈ [1: d ] ,j ∈V c ∩D i ( E ij ∪ E ij )  (142) ≤ P ( E ) + X i ∈ [1: d ] ,j ∈V c ∩D i (cid:2) P ( E ij |E c ) + P ( E ij |E c ) (cid:3) . (143)We ﬁrst consider E . The number of bits transmitted across link ( a, b ) during the correctionphase at time t is d H ( Y V ,t , Z V ,t )( ⌈ log N |Y V |⌉ + 1) + 1 (144)where the ﬁnal +1 accounts for the stop bit. Thus the number of bits transmitted during all n correction phases is d H ( Y n V , Z n V )( ⌈ log N |Y V |⌉ + 1) + n. (145) Recall link ( a, b ) has capacity δ > , meaning it can transmit a bit roughly every /δ timesteps(cf. (20)). Thus we can bound E by P ( E ) = P (cid:18) δ h d H ( Y n V , Z n V )( ⌈ log N |Y V |⌉ + 1) + n i > N nγ n (cid:19) (146) ≤ P Nℓ =1 E d H ( Y n V ( ℓ ) , Z n V ( ℓ ))( ⌈ log N |Y V |⌉ + 1) + nδN nγ n (147) ≤ P Nℓ =1 E p − n log P c ( W ( ℓ ))( ⌈ log N |Y V |⌉ + 1) + nδN nγ n (148) ≤ N q − n log − ǫ n ( ⌈ log N |Y V |⌉ + 1) + nδN nγ n (149) ≤ δ γ n ( ⌈ log N |Y V |⌉ + 1) + 1 δN γ n (150)where (147) follows from Markov’s inequality, (148) follows from Lemma 12, where we havedropped the constant

12 log e since it is less than , (149) from the assumption that W ( ℓ ) ∈ Γ forall ℓ , and (150) from the deﬁnition of γ n in (122). If we choose N = γ − n , then P ( E ) ≤ δ γ n (cid:18)(cid:24) log 1 γ n |Y | (cid:25) + 1 (cid:19) + γ n δ (151) ≤ γ n δ ( − γ n + log |Y | + 3) (152)which vanishes since − γ n log γ n → as γ n → .Now we consider events E ij , E ij . Recall that if E does not occur, then Z n V ( ℓ ) ∈ Q ( W ( ℓ )) for all ℓ . By the deﬁnition of Q ( w ) in (129), we have, for any y n V ∈ Q ( w )1 − ǫ n ≤ P c ( w , y n V ) (153) = X y n V c P Y n V c | Y n V = y n V , W = w ( y V c ) 1( ψ ij ( y nj ) = w i for all i ∈ V , j ∈ V c ∩ D i ) . (154)Note that given Y n V = y n V and W = w , X n V is determined since coding functions are deterministic.Since X i = ∅ for all i ∈ V c , this conditioning also determines X n d . Thus, the distribution P Y n V c | Y n V = y n V , W = w is independent. Applying the blowing up lemma to this distribution and the setof y V c that cause all messages to be decoded correctly in V c , there exists a random sequence Z n V c ∈ Y n V c that causes all messages to be decoded correctly, and E d H ( Y n V c , Z n V c ) ≤ r − n log 1 − ǫ n nγ n . (155)In particular, if we produce N copies of this Z n V c sequence for each layer, then Markov’sinequality gives P ( d H ( Y n V c , Z n V c ) > N nγ n ) ≤ N nγ n N nγ n = γ n . (156) In particular, for each i ∈ [1 : d ] and j ∈ V c ∩ D i , with probability at least − γ n , there exists ˜ y nj that satisﬁes the Hamming distance condition (138), and is decoded correctly to w i . Thus P ( E ij |E c ) vanishes. We now consider E ij . The number of messages w ′ j that are considered isupper bounded by the number of sequences ˜ y n satisfying (138), which is given by ⌊ Nnγ n ⌋ X k =0 (cid:18) N nk (cid:19) |Y i | k ≤ exp { nN ( H ( γ n ) + γ n log |Y i | ) } (157)where H ( · ) is the binary entropy function. The probability that any given w ′ j = W j agrees withthe hash value g j is − nN √ γ n , so P ( E ij |E c ) ≤ exp { nN ( H ( γ n ) + γ n log |Y i | ) − nN √ γ n } (158) ≤ exp {− nN √ γ n / } (159) = exp {− nγ − / / } (160)where (159) holds for sufﬁciently large n , since γ n → and lim p → H ( p ) / √ p = 0 , and (160)holds again by the choice N = γ − n . Since nγ − / → ∞ as n → ∞ , P ( E ij |E c ) vanishes. Remark 9:

The blowing-up lemma does not appear to be strong enough to prove that the veryweak edge removal property implies the ordinary strong converse. Were we to apply the sameargument above to the case ǫ n = ǫ ∈ (0 , , in the key application of the blowing-up lemma in(148), we would have E d H ( Y n V , Z n V ) ≤ r − n − ǫ . (161)This suggests that at least O ( √ n ) bits per layer would be required on the extra link. However,very weak edge removal requires that we achieve the same capacity region using any k n sequenceof bits converging to inﬁnity, which includes sequences growing smaller than √ n .VI. N ETWORKS OF I NDEPENDENT P OINT - TO -P OINT L INKS

We now consider the setting of network equivalence [35], in which N consists of a stationarymemoryless network made up of independent point-to-point (noisy) links. Let ¯ N be the samenetwork in which each noisy point-to-point link is replaced by a noiseless bit-pipe of the samecapacity. The basic result of network equivalence states that C ( N , + ) = C ( ¯ N , + ) . Theorem 10already asserts that for such networks, the weak edge removal property holds if and only ifthe exponentially strong converse holds. The following theorem proves that, for such networkswith acyclic topology, the same holds for the “lower level” in Fig. 1; i.e., the very weak edge removal property and the ordinary strong converse. The proof, given in Appendix E, makes useof the network equivalence principle to connect codes on N to codes on ¯ N , and then appliesTheorem 7 on ¯ N . Theorem 14:

For a discrete stationary memoryless network N consisting of independent point-to-point links with acyclic topology, the very weak edge removal property holds if and only ifthe strong converse holds. VII. A PPLICATIONS

A. Outer Bounds

Consider any outer bound R out ( N ) for the memoryless stationary network N ; i.e. where C ( N , + ) ⊆ R out ( N ) . Suppose we could show [ k n = o ( n ) C V ( N , + , ( k n ) n ) ⊆ R out ( N ) (162)where as usual V is the set of nodes i where X i = ∅ . In other words, the outer bound iscontinuous with respect to the capacity of the extra edge; that is, the outer bound satisﬁes aweak edge removal property. Then, applying Lemma 13, we immediately ﬁnd [ ǫ n : − log(1 − ǫ n )= o ( n ) C ( N , ( ǫ n ) n ) ⊆ R out ( N ) . (163)This suggests that the outer bound holds in an exponentially strong sense; that is, for any ratevector outside R out ( N ) , the probability of error approaches 1 exponentially fast.An outer bound may also satisfy a strong edge removal property, meaning that for someconstant K and any δ , C ( N , + , ( δn ) n ) ⊆ R out ( N ) + [0 , Kδ ] . (164)We have no equivalence between the strong edge removal property and the extremely strongconverse for general noisy networks, but we do for deterministic networks. Thus, applyingLemma 9, if a deterministic network satisﬁes (164), then the outer bound holds in an extremelystrong sense; that is, for any rate vector outside R out ( N ) , the probability of error approaches 1at an exponential rate linear in the distance to the outer bound.For many outer bounds (indeed, almost every computable outer bound that we know of), (162)can be proved without much difﬁculty, and in some cases the stronger statement (164) can beproved as well. This implies that most outer bounds for discrete memoryless networks hold in an exponentially strong sense, and many outer bounds for deterministic networks hold in anextremely strong sense. We illustrate this for several outer bounds (or weak converse arguments)in the next few subsections. B. Cut-set Bound

Recall that the cut-set outer bound [37] is given by C ( N , + ) ⊆ R cut-set ( N ) where R cut-set ( N ) = [ P X ,...,Xd  R : X i ∈S : D i ∩S c = ∅ R i ≤ I ( X S ; Y S c | X S c ) for all S ⊆ [1 : d ]  . (165)In the following, we prove (164) for this bound. This allows us to reproduce the result of[21], that the cut-set bound holds in an exponentially strong sense: that is, for any rate vectoroutside R cut-set ( N ) , the probaility of error goes to 1 exponentially fast. This further impliesthat any network with a tight cut-set bound (i.e., where C ( N , + ) = R cut-set ( N ) ) satisﬁes theexponentially strong converse. Furthermore, we conclude that for deterministic networks, thecut-set bound holds in an extremely strong sense.Fix some sequence ( k n ) n , and let R ∈ C ( N , + , ( k n ) n ) . Consider a code achieving this ratevector, and let Z t be the symbol sent along edge ( a, b ) at time t , or ∅ if there is no symbol attime t . Note H ( Z n ) ≤ k n . Fix any cut set S ⊆ [1 : d ] , and let S c = [1 : d ] \ S . Also let T bethe set of message ﬂows that cross the cut; that is, the set of i ∈ S where D i ∩ S c = ∅ . We maywrite X i ∈T R i = H ( M T ) (166) ≤ I ( M T ; Y n S c , Z n ) + nǫ n (167) = n X t =1 I ( M T ; Y S c ,t , Z t | Y t − S c , Z t − ) + nǫ n (168) = n X t =1 I ( M T ; Y S c ,t , Z t | Y t − S c , Z t − , X S c ,t ) + nǫ n (169) ≤ n X t =1 I ( M T , Y t − S c , X S ,t ; Y S c ,t , Z t | Z t − , X S c ,t ) + nǫ n (170) ≤ n X t =1 (cid:2) I ( M T , Y t − S c , X S ,t ; Y S c ,t | Z t − , X S c ,t ) + H ( Z t | Z t − ) (cid:3) + nǫ n (171) ≤ n X t =1 I ( X S ,t ; Y S c ,t | X S c ,t ) + H ( Z n ) + nǫ n (172) ≤ nI ( X S ; Y S c | X S c , Q ) + k n + nǫ n (173) ≤ nI ( X S ; Y S c | X S c ) + k n + nǫ n (174)where (167) follows from Fano’s inequality, where ǫ n → as n → ∞ ; (169) follows since X S c ,t is a function of Y t − S c and Z t − ; (172) follows from the memorylessness and causality of thenetwork model; and (173) follows by deﬁning Q ∼ Unif [1 : n ] , X i = X i,Q , and Y i = Y i,Q , andby the fact that H ( Z n ) ≤ k n . Recalling that ǫ n → , we have C V ( N , + , ( k n ) n ) ⊆ R cut-set ( N ) + (cid:20) , lim n →∞ k n n (cid:21) d . (175)In particular, (164) holds with K = 1 . This in turn implies (162). Therefore, for discretememoryless stationary networks, the cut-set bound holds in an exponentially strong sense, andfor deterministic networks, the cut-set bound holds in an extremely strong sense.These facts allow us to immediately derive strong converse results for various problems forwhich the cut-set bound is tight. For example:1) since the cut-set bound is tight for relay channels that are degraded, reversely degraded[36], or semideterministic [38], the exponentially strong converse holds.2) since the cut-set bound is tight for linear ﬁnite-ﬁeld deterministic multicast networks [39],the extremely strong converse holds. C. Broadcast Channel

A broadcast channel is a network where Y = ∅ , X i = ∅ for all i > , and we allow multiplemessages to originate at node 1, each to be decoded at a subset of nodes in [2 : d ] . Note that thismodel includes scenarios where there are private messages, public messages, and/or messagesintended for some decoders but not all. We claim that the weak edge removal property andthe exponentially strong converse hold for discrete memoryless broadcast channels. Indeed, the V set in Theorem 10 is simply { } . Thus, for any sequence ( k n ) n (whether or not it is o ( n ) ), C { } ( N , + , ( k n ) n ) = C ( N , + ) , simply because if the extra nodes a and b can only communicatewith node , then any processing done at nodes a and b can simply be reproduced internally atnode 1. Theorem 10 immediately proves the claim.For degraded broadcast channels, the strong converse was proved in [32], and the exponentiallystrong converse in [40]. However, since the capacity of the broadcast channel in general isunknown, strong converses for general broadcast channels have received little attention. As far Fig. 4. The 2-user interference channel. as we know, this is the ﬁrst strong (or exponentially strong) converse that has been provedfor a problem for which the capacity region has no known single-letter characterization. In[41], a strong converse was established for a common randomness generation problem forwhich a single-letter characterization was established in [42]; this strong converse generalizes tonon-discrete alphabets, including sources where the single-letter characterization has no knowncomputable characterization, because of an auxiliary random variable. Both the result of [41]and our result on the broadcast channel are examples of strong converses for problems with noknown computable rate region. The simplicity of the above proof on the broadcast channel, oncewe have Theorem 10, is particularly noteworthy.

D. Discrete 2-User Interference Channel with Strong Interference

A 2-user interference channel, illustrated in Fig. 4, is a network with 4 nodes, where Y = Y = X = X = ∅ , D = { } , and D = { } . Note that, to be consistent with the notation inthe rest of the paper, the received symbol by the node decoding the ﬁrst message is Y , ratherthan Y , as it is typically denoted.Recall that an interference channel has strong interference [43] if I ( X ; Y | X ) ≤ I ( X ; Y | X ) , I ( X ; Y | X ) ≤ I ( X ; Y | X ) (176)for all P X ( x ) P X ( x ) . The capacity region of the interference channel in this regime was foundin [44] to be the set of rate pairs ( R , R ) such that R ≤ I ( X ; Y | X , Q ) , (177) R ≤ I ( X ; Y | X , Q ) , (178) R + R ≤ min { I ( X , X ; Y | Q ) , I ( X , X , Y | Q ) } (179) for some P Q ( q ) P X | Q ( x | q ) P X | Q ( x | q ) with |Q| ≤ .The following proposition establishes the exponentially strong converse under strong interfer-ence. The strong converse for the interference channel with very strong interference (in additionto ﬁxed-error second-order results) was derived in [45]. The strong converse for the Gaussian interference channel with strong interference was proved in [46].

Proposition 15:

For an interference channel with strong interference, weak edge removal andthe exponentially strong converse hold.

Proof:

Note that the only nodes i in an interference channel where X i = ∅ are the encodernodes, i.e. nodes and . Thus, by Theorem 10, to prove the proposition it is enough to showthat for any k n = o ( n ) , C { , } ( N , + , ( k n ) n ) ⊆ C ( N , + ) , where C ( N , + ) is the region deﬁnedin (177)–(179).We claim that an interference channel with strong interference also satisﬁes (176) for anyjoint distribution P X ,X , even when X , X are not independent. Consider any joint distribution P X ,X . For ﬁxed x , deﬁne ˜ X , ˜ X where ˜ X ∼ P X | X = x and ˜ X = x deterministically. Since ˜ X is deterministic, ˜ X and ˜ X are trivially independent, so by (176) we have I ( ˜ X ; ˜ Y | ˜ X ) ≤ I ( ˜ X ; ˜ Y | ˜ X ) (180)where ˜ Y , ˜ Y represent the outputs of the channel with ˜ X , ˜ X as inputs. Note that P ˜ X , ˜ Y , ˜ Y = P X ,X ,Y | X = x . Thus I ( ˜ X ; ˜ Y | ˜ X ) = I ( X ; Y | X = x ) and I ( ˜ X ; ˜ Y | ˜ X ) = I ( X ; Y | X = x ) , so by (180) I ( X ; Y | X = x ) ≤ I ( X ; Y | X = x ) . (181)Since (181) holds for any x , we have I ( X ; Y | X ) = X x P X ( x ) I ( X ; Y | X = x ) (182) ≤ X x P X ( x ) I ( X ; Y | X = x ) (183) = I ( X ; Y | X ) (184)Similar reasoning establishes the second inequality in (176) for any P X ,X . This proves theclaim.Now, by the same proof as the lemma in [44] for the independent case, for any P X n ,X n , I ( X n ; Y n | X n ) ≤ I ( X n ; Y n | X n ) , I ( X n ; Y n | X n ) ≤ I ( X n ; Y n | X n ) (185) where P Y n ,Y n | X n ,X n ( y n , y n | x n , x n ) = n Y t =1 P Y ,Y | X ,X ( y ,t , y ,t | x ,t , x ,t ) . (186)Consider ( R , R ) ∈ C { , } ( N , + , ( k n ) n ) where k n = o ( n ) . Thus, there exists a sequence ofcodes with rates ( R , R ) , with vanishing probability of error, on the modiﬁed network with anextra edge carrying k n bits as a function of the blocklength n . Given a code of blocklength n , let Z t be the signal sent on the edge ( a, b ) at time t ∈ [1 : n ] . Note that, since k n = o ( n ) , for mostvalues of t ∈ [1 : n ] , no bit is transmitted across ( a, b ) at time t (cf. the transmission schedulein (20)); for these t we simply take Z t to be null. Certainly H ( Z n ) ≤ k n . Since for j = 1 , , X nj is a function of message W j and Z n , we have I ( X n ; X n | Z n ) ≤ I ( W ; W | Z n ) (187) ≤ I ( W ; W , Z n ) (188) = I ( W ; W ) + I ( W ; Z n | W ) (189) ≤ H ( Z n ) (190) ≤ k n (191)where (190) follows since the messages are assumed to be independent. Since node a only hasaccess to W , W , we have the Markov chain ( W , W , Z n ) → ( X n , X n ) → ( Y n , Y n ) . (192)We now write nR = H ( W | W ) (193) = I ( W ; Y n , Z n | W ) + H ( W | Y n , W , Z n ) (194) ≤ I ( W ; Y n | W , Z n ) + k n + nǫ n (195) ≤ I ( W , W , X n ; Y n | X n , Z n ) + k n + nǫ n (196) ≤ I ( X n ; Y n | X n , Z n ) + k n + nǫ n (197)where in (195) we have used the fact that H ( Z n ) ≤ k n , and Fano’s inequality, where ǫ n → as n → ∞ , and (197) holds by the Markov chain in (192). Similarly nR ≤ nI ( X n ; Y n | X n , Z n ) + k n + nǫ n . (198) We also have nR = H ( W ) (199) ≤ I ( W ; Y n , Z n ) + nǫ n (200) ≤ I ( W ; Y n | Z n ) + k n + nǫ n (201) ≤ I ( W , X n ; Y n | Z n ) + k n + nǫ n (202) = I ( X n ; Y n | Z n ) + I ( W ; Y n | X n , Z n ) + k n + nǫ n (203) ≤ I ( X n ; Y n | Z n ) + I ( W ; Y n , X n | X n , Z n ) + k n + nǫ n (204) = I ( X n ; Y n | Z n ) + I ( W ; X n | X n , Z n ) + k n + nǫ n (205) ≤ I ( X n ; Y n | Z n ) + I ( W ; W | Z n ) + k n + nǫ n (206) ≤ I ( X n ; Y n | Z n ) + 2 k n + nǫ n (207)where in (205) we have again used the Markov chain in (192). Combining (198) with (207)gives n ( R + R ) ≤ I ( X n ; Y n | Z n ) + I ( X n ; Y n | Z n , X n ) + 3 k n + n ǫ n (208) ≤ I ( X n ; Y n | Z n ) + I ( X n ; Y n | Z n , X n ) + 3 k n + n ǫ n (209) = I ( X n , X n ; Y n | Z n ) + 3 k n + n ǫ n (210)where (209) follows from (185). We may also repeat this argument to ﬁnd (210) with Y replacedby Y . To summarize, nR ≤ I ( X n ; Y n | X n , Z n ) + k n + nǫ n , (211) nR ≤ I ( X n ; Y n | X n , Z n ) + k n + nǫ n , (212) n ( R + R ) ≤ min { I ( X n , X n ; Y n | Z n ) , I ( X n , X n ; Y n | Z n ) } + 3 k n + n ǫ n , (213) k n ≥ I ( X n ; X n | Z n ) . (214)One can see that this is precisely the region for the interference channel when both messages arerequired to be decoded at both decoders, except that we have close-to-independence instead ofexact independence. The difﬁculty with condition (214) is not just that X n , X n are not perfectlyindependent, but that the dependence between individual letters X ,t , X ,t may vary dependingon t . The method of Dueck in [47] (also similar to Ahlswede’s “wringing” technique [48]) allows us to show that for most t ∈ [1 : n ] , the letters X ,t , X ,t are nearly independent. This will allowsingle-letterization of the region in (211)–(214). In particular, there exist some m ≤ √ nk n and t , . . . , t m ∈ [1 : n ] , where for all t ∈ [1 : n ] I ( X ,t ; X ,t | Q ′ ) ≤ r k n n . (215)where Q ′ = ( Z n , X ,t , . . . , X ,t m , X ,t , . . . , X ,t m ) . (216)We reproduce the essential proof of this fact from [47] as follows. First, let T = ( t ∈ [1 : n ] : I ( X ,t ; X ,t | Z n ) > r k n n ) . (217)If T is empty, then we may take m = 0 and we are done. Otherwise, let t be any element of T . We may write I ( X n ; X n | Z n , X ,t , X ,t ) = I ( X n ; X n | Z n ) − I ( X n ; X ,t | Z n ) − I ( X ,t ; X n | Z n , X ,t ) (218) ≤ I ( X n ; X n | Z n ) − I ( X ,t ; X ,t | Z n ) (219) ≤ k n − r k n n . (220)where (220) follows from (214) and the fact that t ∈ T as deﬁned in (217). Next, let T = ( t ∈ [1 : n ] : I ( X ,t ; X ,t | Z n , X ,t , X ,t ) > r k n n ) . (221)If T is empty, then we may take m = 1 and again we are done. Otherwise, take t to be anyelement of T , and proceed as above. This process must terminate after a ﬁnite number (say m )of steps, at which point (215) must hold for all t . By a similar argument as in (218)–(220), foreach i ∈ [1 : m ] I ( X n ; X n | Z n , X ,t , . . . , X ,t i , X ,t , . . . , X ,t i ) ≤ k n − i r k n n (222)and in particular I ( X n ; X n | Q ′ ) ≤ k n − m r k n n . (223)Since the mutual information is nonnegative, we have m ≤ √ nk n .We now have I ( X n ; Y n | X n , Z n ) ≤ I ( X n ; Y n | X n , Q ′ ) + H ( X ,t , . . . , X ,t m , X ,t , . . . , X ,t m ) (224) ≤ I ( X n ; Y n | X n , Q ′ ) + m log |X | · |X | (225) ≤ I ( X n ; Y n | X n , Q ′ ) + p nk n log |X | · |X | (226) = n X t =1 I ( X n ; Y ,t | Y t − , X n , Q ′ ) + n p nk n log |X | · |X | (227) ≤ n X t =1 I ( X t ; Y ,t | X ,t , Q ′ ) + n p nk n log |X | · |X | (228) = nI ( X ; Y | X , Q ) + n p nk n log |X | · |X | (229)where Q ′′ ∼ Unif [1 : n ] , Q = ( Q ′ , Q ′′ ) , X = X ,Q ′′ , X = X ,Q ′′ , Y = Y ,Q ′′ , Y = Y ,Q ′′ . (230)Applying (211), and performing similar analyses for (212)–(213), combined with (215), we have R ≤ I ( X ; Y | X , Q ) + k n n + ǫ n + r k n n log |X | · |X | , (231) R ≤ I ( X ; Y | X , Q ) + k n n + ǫ n + r k n n log |X | · |X | , (232) R + R ≤ min { I ( X , X ; Y | Q ) , I ( X , X , Y | Q ) } + 3 k n n + 2 ǫ n + r k n n log |X | · |X | , (233) r k n n ≥ I ( X ; X | Q ) . (234)Using standard tools to bound the cardinality of auxiliary random variables (e.g., [29, AppendixC]), for each n , there exists a joint distribution P ( n ) QX X with |Q| ≤ that preserves the valueof each mutual information quantity in (231)–(234). Recall that we started with a differentcode for each blocklength n , so the above procedure results in a different joint distribution P ( n ) QX X for each n . This constitutes a sequence of joint distributions on a compact set, so thereexists a convergent subsequence, with limit P QX X . Since k n = o ( n ) , ǫ n → , and mutualinformation is continuous for ﬁxed alphabets, this limiting distribution must satisfy (177)–(179);moreover, in the limit (234) implies that I ( X ; X | Q ) = 0 , we may factor the joint distributionas P Q P X | Q P X | Q . Finally, we may further reduce the cardinality of the auxiliary random variablein (177)–(179) to |Q| ≤ . VIII. C ONCLUSIONS

This paper explored the relationship between edge removal properties and strong converses.Our main results are summarized in Fig. 1. We found three main levels of properties for both edge removal and strong converse, and showed that for a very large class of networks, the strongconverse property implies the corresponding edge removal property. Implications in the oppositedirection hold for deterministic networks and sometimes for memoryless stationary networks.Our strongest results are those for the “middle” level in Fig. 1, connecting the weak edgeremoval property to the exponentially strong converse. In particular, we showed that theseproperties are equivalent for all discrete memoryless stationary networks. Thus, if an existingweak converse or outer bound can be strengthened to show that it still holds in the presenceof an extra link carrying a sub-linear number of bits, then the converse or outer bound alsoholds in an exponentially strong sense, meaning that for any rate vector outside the region, theprobability of error converges to 1 exponentially fast. It appears that many existing argumentscan be strengthened in this sense with relatively little effort, thereby proving exponentiallystrong results. We believe that this middle level deserves more focus than it has received so far,because exponentially strong converses and weak edge removal properties seem to hold for somany problems (at least under average probability of error). Therefore, one should always askwhether a given converse proof can be strengthened in this sense.Several open problems remain:1) The most important question is whether edge removal and strong converse properties holdin general. In particular, we know of no memoryless stationary network for which the weakedge removal property or the exponentially strong converse does not hold under averageprobability of error. The techniques of Sec. VII seem to allow one to prove a weak edgeremoval property (and thus an exponentially strong converse) for most (perhaps all) existingsingle-letter outer bounds, but there is no apparent way to do this without an existing single-letter result. Our observation that the properties hold for the discrete broadcast channelsuggest that it may be possible to prove such results even for problems without knownsingle-letter characterizations of the capacity region, but we know of no other cases forwhich this has been done.2) Many of our results (particularly those showing that edge removal implies a strong con-verse) apply only for discrete channel coding problems; generalizing these results to contin-uous systems, channel cost constraints, source coding contexts, and random channel statewould allow applicability to many other important network information theory problems.3) We conjecture that an equivalence holds for discrete memoryless networks on the “lowerlayer” in Fig. 1, between very weak edge removal and the ordinary strong converse, but we have only been able to prove this result for deterministic networks and acyclic networksof independent point-to-point links.4) Finally, it would be interesting to ﬁnd a strong converse property equivalent to the ex-tremely weak edge removal property.A CKNOWLEDGEMENTS

The authors would like to thank Vincent Y. F. Tan, Michelle Effros, and Silas L. Fong forhelpful discussions and feedback. A

PPENDIX AP ROOF OF P ROPOSITION C ( N , ( ǫ n ) n ) ⊆ C ( N , (˜ ǫ n ) n ) ; the opposite direction follows by reversing theroles of ǫ n and ˜ ǫ n . Fix any rate vector R ∈ [ n ∈ N \ n ≥ n R ( N , n, ǫ n ) . (235)We aim to show that R ∈ C ( N , (˜ ǫ n ) n ) . There exists n ∈ N such that for all n ≥ n , R ∈R ( N , n, ǫ n ) . By the assumption of the lemma, there exists a subsequence n i such that lim i →∞ − n i log(1 − ǫ n i ) = α. (236)For sufﬁciently large i , we have n i ≥ n , so R ∈ R ( N , n i , ǫ n i ) . That is, there exists an n i -lengthcode with rate R and probability of error at most ǫ n i . Fix integer N , and form a new code onnetwork N of length n i N and rate N − N R as follows. Roughly, reduce the overall probability oferror by repeating the original code N times, and introducing a small amount of error correctionin the form of an outer maximum distance separable (MDS) code [49, Chap. 4]. In particular,for each node v ∈ [1 : d ] where R v > , form a ( N, N − MDS code on symbols fromthe ﬁnite ﬁeld of order ⌊ n i R v ⌋ . This code exists for sufﬁciently large i (e.g., a Reed-Solomoncode [49, Chap. 5]). Let the MDS codeword be denoted by ( W v (1) , . . . , W v ( N )) . Repeat theoriginal code N times, where on the ℓ th repetition W v ( ℓ ) is treated as the message originatingat node v . Because each outer code is MDS, one error can be corrected, so if it most one ofthe N repetitions results in an error, the full code will decode correctly. Because the network ismemoryless and stationary, each repetition is independent and results in error with probability ǫ n i , so the probability of error for the full code is given by P e = 1 − (1 − ǫ n i ) N − N ǫ n i (1 − ǫ n i ) N − (237) = 1 − (1 − ǫ n i ) N − [1 − ǫ n i + N ǫ n i ] . (238)Note that (236) and the assumption that α > imply that ǫ n i → , meaning − ǫ n i + N ǫ n i → N .Thus lim i →∞ n i log(1 − P e ) = lim i →∞ n i (cid:2) ( N −

1) log(1 − ǫ n i ) + N (cid:3) (239) = − ( N − α. (240)In particular, for sufﬁciently large i , we have − P e ≥ exp {− n i ( N − / α } (241)Hence, for any N and sufﬁciently large i , N − N R ∈ R ( N , n i N, − exp {− n i ( N − / α } ) . (242)Consider any blocklength m where n i N ≤ m ≤ n i ( N + 1) . We may convert a code withblocklength n i N to one with blocklength m simply by ignoring the additional m − n i N symbols.This reduces the rate by a factor of n i Nm ≥ NN +1 , but does not change the probability of error.Thus we have N − N + 1 R ∈ R ( N , m, − exp {− n i ( N − / α } ) . (243)By the liminf assumption on ˜ ǫ n in (13), for sufﬁciently large m we have − m log(1 − ˜ ǫ m ) ≥ N − / N α. (244)Thus, if m ≥ n i N , we have ˜ ǫ m ≥ − exp (cid:26) − m N − / N α (cid:27) (245) ≥ − exp {− n i ( N − / α } (246)where (245) holds by (244) for sufﬁciently large i . Hence, for any N , for all m sufﬁciently largewe have N − N + 1 R ∈ R ( N , m, ˜ ǫ m ) . (247)Thus N − N + 1 R ∈ C ( N , (˜ ǫ n ) n ) . (248)Since (248) holds for all N , and C ( N , (˜ ǫ n ) n ) is closed, we have R ∈ C ( N , (˜ ǫ n ) n ) . Note thatboth i and N must go to inﬁnity, but i converges to inﬁnity ﬁrst for ﬁxed N in (240). A PPENDIX BP ROOF OF P ROPOSITION Extremely strong converse ⇔ (1b) : By taking γ = Kα , the extremely strong converse holdsif and only if, for any α ≥ , C ( N , (1 − − nα ) n ) ⊆ C ( N , + ) + [0 , Kα ] . (249)By Proposition 1, C ( N , ( ǫ n ) n ) = C ( N , (1 − − nα ) n ) if − ǫ n . = 2 − nα . This proves that theextremely strong converse is equivalent to the condition in (1b). (1a) ⇒ (1b) . Consider any ǫ n where − ǫ n . = 2 − nα , and any R ∈ C ( N , ( ǫ n ) n ) . If R ∈C ( N , + ) , then obviously R ∈ C ( N , + ) + [0 , Kα ] d . If R / ∈ C ( N , + ) , then by condition (1a)we have α ≥ β/K , and R ∈ C ( N , + ) + [0 , β ] d . Thus R ∈ C ( N , + ) + [0 , Kα ] d . This proves(1b). (1b) ⇒ (1a) . Consider any R / ∈ C ( N , + ) , and any sequence of ( R , n ) codes with probabilityof error ǫ n . By Proposition 1, this implies R ∈ C ( N , (1 − − nα ) n ) , where α = lim inf n →∞ − n log(1 − ǫ n ) . (250)Hence, by condition (1b), R ∈ C ( N , + ) + [0 , Kα ] d . If β is the smallest number such that R ∈ C ( N , + ) + [0 , β ] d , then we have β ≤ Kα . This proves (17), and hence (1c). Exponentially strong converse ⇒ (2b). Let ǫ n be a sequence where − log(1 − ǫ n ) = o ( n ) . Bythe exponentially strong converse, for any γ > there exists ǫ ′ n where − log(1 − ǫ ′ n ) = Θ( n ) where (16) holds. For sufﬁciently large n , − log(1 − ǫ n ) ≤ − log(1 − ǫ ′ n ) , meaning ǫ n ≤ ǫ ′ n . Thus C ( N , ( ǫ n ) n ) ⊆ C ( N , ( ǫ ′ n ) n ) ⊆ C ( N , + ) + [0 , γ ] d . (251)As this holds for all γ > , we have C ( N , ( ǫ n ) n ) ⊆ C ( N , + ) . This proves condition (2b). (2b) ⇒ Exponentially strong converse.

Speciﬁcally, we prove that if the exponentially strongconverse does not hold, then condition (2b) does not hold. Suppose there exist γ > such thatfor all ǫ n where − log(1 − ǫ n ) = Θ( n ) , C ( N , ( ǫ n ) n )

6⊆ C ( N , + ) + [0 , γ ] d . Speciﬁcally, for anyinteger r , C ( N , (1 − exp {− n/r } ) n )

6⊆ C ( N , + )+[0 , γ ] d . Since the sets C ( N , (1 − exp {− n/r } ) n ) are sorted (decreasing as r grows), there exists R in the interior of C ( N , (1 − exp {− n/r } ) n ) forall integers r such that R / ∈ C ( N , + ) . For all r , there exists n ( r ) such that for all n ≥ n ( r ) , R ∈ R ( N , n, − exp {− n/r } ) . (252) Deﬁne a sequence ǫ n = min r : n ≥ n ( r ) − exp {− n/r } . (253)Note that − log(1 − ǫ n ) ≤ n/r for n ≥ n ( r ) , so − log(1 − ǫ n ) = o ( n ) . Moreover, for any n ,there is some r such that n ≥ n ( r ) and ǫ n = 1 − exp {− n/r } , so by (252), R ∈ R ( N , n, ǫ n ) for all n . Thus R ∈ C ( N , ( ǫ n ) n ) . But since R / ∈ C ( N , + ) , (2b) does not hold. (2a) ⇒ (2b) . By (2a), for any R / ∈ C ( N , + ) , the probability of correct decoding must vanishexponentially fast, so R / ∈ C ( N , ( ǫ n ) n ) for any sequence ǫ n such that − log(1 − ǫ n ) = o ( n ) .Therefore C ( N , ( ǫ n ) n ) ⊆ C ( N , + ) , which proves (2b). (2b) ⇒ (2a). For any R / ∈ C ( N , + ) and any sequence ǫ n for which R ∈ C ( N , ( ǫ n ) n ) , itcannot be that − log(1 − ǫ n ) = o ( n ) , or else by (2b) we would have R ∈ C ( N , + ) . Therefore ǫ n must approach 1 exponentially fast, which proves (2a). Strong converse ⇒ (3b). Note that the condition in the deﬁnition of the strong converse that − log(1 − ǫ n ) → ∞ can be more simply written as ǫ n → . Consider any ǫ ∈ (0 , . By the strongconverse, for any γ > , there exists a sequence ǫ n → where C ( N , ( ǫ n ) n ) ⊆ C ( N , + ) + [0 , γ ] d .Noting that ǫ ≤ ǫ n for sufﬁciently large n , we have C ( N , ( ǫ ) n ) ⊆ C ( N , ( ǫ n ) n ) ⊆ C ( N , + ) +[0 , γ ] d . As this holds for all γ > , we have C ( N , ( ǫ ) n ) = C ( N , + ) , which proves (3b). (3b) ⇒ (3c). By (3b), for any integer r , C ( N , (1 − /r ) n ) = C ( N , + ) . In particular, thereexists n ( r ) such that for all n ≥ n ( r ) , R (cid:18) N , n, − r (cid:19) ⊆ C ( N , + ) + (cid:20) , r (cid:21) d . (254)Deﬁne a sequence ǫ n = sup r : n ≥ n ( r ) − r . (255)Certainly ǫ n ≥ − /r for n ≥ n ( r ) , meaning ǫ n → . Moreover, if n, r are such that ǫ n = 1 − r ,then R ( N , n, ǫ n ) = R (cid:18) N , n, − r (cid:19) ⊆ C ( N , + ) + (cid:20) , r (cid:21) d = C ( N , + ) + [0 , − ǫ n ] d . (256)Since − ǫ n → , we have C ( N , ( ǫ n ) n ) = C ( N , + ) . (257)This proves (3c). (3c) ⇒ Strong converse.

By (3c), there exists a sequence ǫ n → where C ( N , ( ǫ n ) n ) = C ( N , + ) ⊆ C ( N , + ) + [0 , γ ] d for all γ > . This proves the strong converse. (3c) ⇒ (3a). By (3c), there exists ǫ n → where R / ∈ C ( N , ( ǫ n ) n ) for any R / ∈ C ( N , + ) .This implies that any sequence of ( R , n ) codes must have probability of error exceeding ǫ n forsufﬁciently large n , so the probability of error must approach 1, which proves (3a). (3a) ⇒ (3b). For any ǫ ∈ (0 , , by (3a) any R / ∈ C ( N , + ) has probability of error approaching1, so R / ∈ C ( N , ( ǫ ) n ) . Therefore, C ( N , ( ǫ ) n ) = C ( N , + ) , which proves (3b).A PPENDIX CP ROOF OF P ROPOSITION Q X,Y , we may write D ( Q Y | X k P Y | X | Q X ) = X x,y Q X,Y ( x, y ) log Q Y | X ( y | x ) P Y | X ( y | x ) (258) = X x,y Q X,Y ( x, y ) (cid:20) log Q Y | X ( y | x ) Q Y ( y ) − log P Y | X ( y | x ) P Y ( y ) + log Q Y ( y ) P Y ( y ) (cid:21) (259) = I Q X,Y ( X ; Y ) − X x,y Q X,Y ( x, y ) log P Y | X ( y | x ) P Y ( y ) + D ( Q Y k P Y ) (260) ≥ I Q X,Y ( X ; Y ) − C (261)where (261) follows from (19), and the fact that relative entropy is non-negative. Thus, we maylower bound α ( R ) by α ( R ) ≥ min Q X,Y h I Q X,Y ( X ; Y ) − C + | R − I Q X,Y ( X ; Y ) | + i (262) ≥ R − C (263)where (263) holds because x + | y − x | + ≥ y for any real numbers x, y . This lower bound isachievable by setting Q X,Y = P X × P Y | X , where P X is any capacity-achieving input distribution,so indeed α ( R ) = R − C .Now consider a channel where (19) does not hold. That is, there exists some x , y where log P Y | X ( y | x ) P Y ( y ) > C. (264)Let P X be any capacity-achieving input distribution. Thus, X x,y P X ( x ) P Y | X ( y | x ) log P Y | X ( y | x ) P Y ( y ) = C. (265)In particular, there exists some x , y where log P Y | X ( y | x ) P Y ( y ) ≤ C (266) and P X ( x ) P Y | X ( y | x ) > . For parameter λ ≥ , deﬁne a joint distribution Q ( λ ) X,Y where Q ( λ ) X,Y ( x, y ) = P X ( x ) P Y | X ( y | x ) + λ x = x , y = y ) − λ x = x , y = y ) . (267)As long as ≤ λ ≤ P X ( x ) P Y | X ( y | x ) , this is a valid distribution. If we marginalize out X ,we see that Q ( λ ) Y ( y ) = P Y ( y ) + λ y = y ) − λ y = y ) . (268)By [51, Lemma 17.3.3], the ﬁrst term in the Taylor expansion for D ( Q ( λ ) Y k P Y ) around λ = 0 is X y ( Q ( λ ) Y ( y ) − P Y ( y )) P Y ( y ) = λ (cid:18) P Y ( y ) + 1 P Y ( y ) (cid:19) . (269)By [50, Cor. 1 in Sec. 4.5], P Y ( y ) > for all y that are reachable from some input symbol. Notethat (264) implies that P Y | X ( y | x ) > , and also by assumption P Y | X ( y | x ) > . That is, both y and y are reachable output symbols, so P Y ( y ) , P Y ( y ) > . Thus in (269) the coefﬁcienton λ is ﬁnite, and so ddλ D ( Q ( λ ) Y k P Y ) (cid:12)(cid:12)(cid:12) λ =0 = 0 (270)Noting that ∂∂Q XY ( x, y ) I Q XY ( X ; Y ) = log Q Y | X ( y | x ) Q Y ( y ) − (271)we have ζ := ddλ I Q ( λ ) X,Y ( X ; Y ) (cid:12)(cid:12)(cid:12) λ =0 = log P Y | X ( y | x ) P Y ( y ) − log P Y | X ( y | x ) P Y ( y ) > (272)where we have used the assumptions in (264) and (266). Applying the derivation in (258)–(260),we have ddλ D ( Q ( λ ) Y | X k P Y | X | Q ( λ ) X ) (cid:12)(cid:12)(cid:12) λ =0 (273) = ddλ " I Q ( λ ) X,Y ( X ; Y ) − X x,y Q ( λ ) X,Y ( x, y ) log P Y | X ( y | x ) P Y ( y ) + D ( Q ( λ ) Y k P Y ) λ =0 (274) = 0 (275)where we have used (270), (272), and the fact that ζ is also the derivative of the second termin (274). Given λ small enough so that Q ( λ ) X,Y is a valid distribution, we may upper bound α ( C + ζ λ ) ≤ D ( Q ( λ ) Y | X k P Y | X | Q ( λ ) X ) + | C + ζ λ − I Q ( λ ) X,Y ( X ; Y ) | + . (276)Thus, dα ( R ) dR (cid:12)(cid:12)(cid:12) R = C = lim λ → α ( C + ζ λ ) ζ λ (277) ≤ lim λ → ζ λ h D ( Q ( λ ) Y | X k P Y | X | Q ( λ ) Y ) + | C + ζ λ − I Q ( λ ) X,Y ( X ; Y ) | + i (278) = 1 ζ ddλ D ( Q ( λ ) Y | X k P Y | X | Q ( λ ) Y ) (cid:12)(cid:12)(cid:12) λ =0 + (cid:12)(cid:12)(cid:12)(cid:12) − ζ ddλ I Q ( λ ) X,Y ( X ; Y ) (cid:12)(cid:12)(cid:12) λ =0 (cid:12)(cid:12)(cid:12)(cid:12) + (279) = 0 (280)where in (279) we have used the fact that Q (0) X,Y = P X × P Y | X , so I Q (0) X,Y ( X ; Y ) = C ; and (280)follows from the deﬁnition of ζ in (272), as well as (275). Note also that this derivation isvalid only because ζ > , as shown in (272). Since α ( R ) is non-decreasing in R , we must have dα ( R ) dR (cid:12)(cid:12) R = C = 0 . A PPENDIX DP ROOF OF P ROPOSITION γ > , there exists a sequence k n = Θ( n ) satisfying (22). Let δ ′ = lim inf n →∞ k n n . (281)Note that δ ′ , and so for any < δ < δ ′ , we have δn ≤ k n for sufﬁciently large n . Thus C ( N , + , ( δn ) n ) ⊆ C ( N , + , ( k n ) n ) ⊆ C ( N , + ) + [0 , γ ] d . (282)Hence, the LHS of (24) is contained in C ( N , + ) + [0 , γ ] d . Since this holds for all γ > , thisproves (24).Now we show that (24) implies the weak edge removal property. For any γ > , by (24) thereexists δ > such that C ( N , + , ( δn ) n ) = C ( N , + ) + [0 , γ ] d . Thus, setting k n = δn satisﬁes(22). This proves the weak edge removal property.To prove that the weak edge removal property is also equivalent to (25), we will show that [ k n = o ( n ) C ( N , + , ( k n ) n ) = \ δ> C ( N , + , ( δn ) n ) . (283)To show ⊆ in (283), we need to show that for all k n = o ( n ) , C ( N , + , ( k n ) n ) is contained inthe RHS of (283), or that C ( N , + , ( k n ) n ) ⊆ C ( N , + , ( δn ) n ) for all δ > . Indeed this holdsbecause for any k n = o ( n ) and any δ > , k n ≤ δn for sufﬁciently large n . To show ⊇ in(283), let R be in the RHS of (283). Thus, for all ǫ, δ, γ > , for sufﬁciently large n we have R ∈ R ( N , n, ǫ, nδ ) + [0 , γ ] d . In particular, for any ﬁxed integer r , we may let ǫ = δ = γ = 1 /r ,so there exists n ( r ) such that for all n ≥ n ( r ) we have R ∈ R (cid:18) N , n, r , nr (cid:19) + (cid:20) , r (cid:21) d . (284)Let r n = max { r : n ( r ) ≤ n } . (285)By (284), for any n we have R ∈ R (cid:18) N , n, r n , nr n (cid:19) + (cid:20) , r n (cid:21) d . (286)Letting k n = nr n , we may rewrite (286) as R ∈ R (cid:18) N , n, k n n , k n (cid:19) + (cid:20) , k n n (cid:21) d . (287)Note that for any integer r , if n ≥ n ( r ) , then r n ≥ r , so k n ≤ n/r . Thus k n /n → ; i.e., k n = o ( n ) . From (287), we have R ∈ C ( N , + , ( k n ) n ) . This proves ⊇ in (283).We now prove statement 3. Note that the very weak edge removal property is equivalent tothe statement that for all γ > , \ k n : k n →∞ C ( N , + , ( k n ) n ) ⊆ C ( N , + ) + [0 , γ ] d . (288)This is easily seen to be equivalent to (26).To show that the very weak edge removal property is also equivalent to (27), we show that \ k n : k n →∞ C ( N , + , ( k n ) n ) = \ ǫ> [ k ∈ N C ( N , ( ǫ ) n , ( k ) n ) . (289)Noting that \ k n : k n →∞ C ( N , + , ( k n ) n ) = \ k n : k n →∞ \ ǫ> C ( N , ( ǫ ) n , ( k n ) n ) = \ ǫ> \ k n : k n →∞ C ( N , ( ǫ ) n , ( k n ) n ) (290)it is enough to show that for all ǫ > , \ k n : k n →∞ C ( N , ( ǫ ) n , ( k n ) n ) = [ k ∈ N C ( N , ( ǫ ) n , ( k ) n ) . (291)For any k ∈ N and any sequence k n → ∞ , k ≤ k n for sufﬁciently large n . Thus \ k n : k n →∞ C ( N , ( ǫ ) n , ( k n ) n ) ⊇ [ k ∈ N C ( N , ( ǫ ) n , ( k ) n ) . (292) Taking a closure yields ⊇ in (291), since the LHS of (291) is already closed. To prove theopposite direction, let γ k be a positive sequence where lim k →∞ γ k → . For ﬁxed ǫ ∈ (0 , and k ∈ N , by the deﬁnition of C ( N , ( ǫ ) n , ( k ) n ) in (21), there exists n ( k ) such that for all n ≥ n ( k ) , we have R ( N , n, ǫ, k ) ⊆ C ( N , ( ǫ ) n , ( k ) n ) + [0 , γ k ] d . (293)Now deﬁne a sequence k n = max { k : n ≥ n ( k ) } . (294)Note that for any k ∈ N , k n ≥ k for all n ≥ n ( k ) , so k n → ∞ as n → ∞ , because for any k , k n ≥ k for all n ≥ n ( k ) . Thus the LHS of (291) is contained in C ( N , ( ǫ ) n , ( k n ) n ) . Moreover C ( N , ( ǫ ) n , ( k n ) n ) = [ n ∈ N \ n ′ ≥ n R ( N , n ′ , ǫ, k n ′ ) (295) ⊆ [ n ∈ N \ n ′ ≥ n (cid:0) C ( N , ( ǫ ) n , ( k n ′ ) n ) + γ k n ′ (cid:1) (296) = [ n ∈ N \ n ′ ≥ n C ( N , ( ǫ ) n , ( k n ′ ) n ) (297) ⊆ [ k ∈ N C ( N , ( ǫ ) n , ( k ) n ) (298)where (295) holds by deﬁnition, (296) follows from (293), (297) holds because γ k → , and(298) holds because for any n ′ , k n ′ is some integer. This proves ⊆ in (291).We now prove statement 4. The deﬁnition of the extremely weak edge removal property maybe equivalently written [ bounded k n C ( N , + , ( k n ) n ) ⊆ \ γ> C ( N , + ) + [0 , γ ] d . (299)Note that for any bounded k n , C ( N , + , ( k n ) n ) ⊆ C ( N , + , ( k ) n ) for some constant integer k .Thus the LHS (299) can be written [ k ∈ N C ( N , + , ( k ) n ) . (300)Moreover, the RHS of (299) is simply C ( N , + ) . Therefore the extremely weak edge removalproperty is equivalent to (28). A PPENDIX EP ROOF OF T HEOREM

14A signiﬁcant technical tool in proving network equivalence (cf. see the discussion in Sec. VI,and the original result in [35]) is the idea of channel simulation, in which a point-to-point channelis accurately simulated by any other with higher capacity. This idea was at the heart of the proofin [35]. A version of this idea was stated in [53] as the universal channel simulation lemma ,stated as follows. This lemma states that two nodes with shared randomness (represented by U )can use a noiseless link to accurately simulate a noisy channel, as long as the capacity of thenoiseless link is greater than the capacity of the noisy channel. While [53] did not provide aproof, we presented a proof in the appendix of [54]. Lemma 16:

Let ( X , Q Y | X , Y ) be a discrete memoryless channel with capacity C . Given a rate R > C , a channel simulation code ( f, g ) consists of • f : X n × [0 , → { , } nR , • g : { , } nR × [0 , → Y n .Let P Y n | X n be the conditional pmf of Y n given X n where U ∼ Unif [0 , and Y n = g ( f ( X n , U ) , U ) . (301)There exists a sequence of length- n simulation codes where lim n →∞ max x n d TV ( P Y n | X n = x n , Q Y n | X n = x n ) = 0 . (302)We now proceed to prove Theorem 14. By Theorem 5, we only need to show that the veryweak edge removal property implies the ordinary strong converse. The basic approach is touse network equivalence to convert a code for noisy network N into a code on the noiselessversion, then apply Lemma 9 on this noiseless network, and then again use network equivalenceto convert back to the noisy network.Let E ⊂ [1 : d ] × [1 : d ] be the set of pairs of nodes connected by point-to-point links. Recallthat by assumption, the directed graph ([1 : d ] , E ) is acyclic. Thus, by [55, Prop. 19.1] we mayassign each node i a distinct integer π i ∈ [1 : d ] where π i < π j if ( i, j ) ∈ E . For any ( i, j ) ∈ E ,let C i → j be the capacity of the link from i to j . Assume without loss of generality that C i → j > for all ( i, j ) ∈ E . Let C min = min ( i,j ) ∈E C i → j , so in particular C min > . Denote X i → j and Y i → j as the input and output respectively of the link ( i, j ) . Thus the transmitted symbol from node i can be written X i = ( X i → j : ( i, j ) ∈ E ) (303) and the received symbol at node j can be written Y j = ( Y i → j : ( i, j ) ∈ E ) . (304)Let R be achievable with respect to ﬁxed ǫ ∈ (0 , . Thus, for sufﬁciently large n , there existsa length- n code for network N with rate R and probability of error ǫ . By (9)–(10), this code isdeﬁned by encoding functions φ it for each node i ∈ [1 : d ] and time t ∈ [1 : n ] , and decodingfunctions ψ i for each node i ∈ [1 : d ] . It will be useful to work with coding functions on n -lengthblocks rather than single time instances, so we deﬁne the block-wise encoding function at node i φ ni : [1 : 2 nR i ] × Y ni → X ni (305)as φ ni ( w i , y ni ) = ( φ i ( w i ) , φ i ( w i , y i ) , . . . , φ in ( w i , y n − i )) . (306)Using the notation in (304), we may notate the arguments to this function as φ ni ( w i , y nk → i : ( k, i ) ∈ E ) . (307)Due to the network being acyclic, we may form a pipelined block-Markov version of this codeas follows. Given integer N , we form a code with length n ( N + d ) and rate NN + d R . The outerblocklength N serves a similar function as it did for network stacking, but here it representsthe number of message blocks transmitted subsequently, rather than the number of stacks. Notethat message i consists of N nR i bits, which we denote W i (1) , . . . , W i ( N ) , each consisting of nR i bits. We then pipeline N copies of the original code, encoding n -length blocks at a time.In particular, we introduce notation X n ( N + d ) j = ( X nj (1) , . . . , X nj ( N + d )) , (308) Y n ( N + d ) i → j = ( Y ni → j (1) , . . . , Y ni → j ( N + d )) . (309)Now, we deﬁne the coding operations at node j by, for all ℓ ∈ [1 : N ] , X nj ( ℓ + π j ) = φ nj ( W j ( ℓ ) , Y ni → j ( ℓ + π i ) : ( i, j ) ∈ E ) . (310)Recall that if ( i, j ) ∈ E , then π i < π j , meaning that the arguments of φ nj in (310) are causallyavailable. Note that (310) does not specify all channel inputs, namely X nj ( ℓ ′ ) for ℓ ′ ∈ [1 : π j ] ∪ [ N + π j + 1 : N + d ] ; these channel inputs can be arbitrary, as the corresponding channeloutputs will be ignored. To decode at node i , for all ℓ ∈ [1 : N ] let ( ˆ W ji ( ℓ ) : i ∈ D j ) = ψ i ( W i ( ℓ ) , Y nk → i ( ℓ + π k ) : ( k, i ) ∈ E ) . (311)Observe that the variables associated with a given index ℓ ∈ [1 : N ] associate only withthemselves, and behave exactly like the original n -length code. Thus, an error occurs on thispipelined code if and only if any of the N copies make an error, so the probability of error is − (1 − ǫ ) N . (312)Thus we have NN + d R ∈ R ( N , n ( N + d ) , − (1 − ǫ ) N ) . (313)Note that in this pipelined code, encoding operations are performed on n -length blocks at atime. Thus, the pipelined code on N can be converted to one on a deterministic network usingchannel simulation codes. In particular, ﬁx ∆ ∈ (0 , C min ) and let ¯ N ∆ be the network of noiselesslinks where link ( i, j ) is replaced by a noiseless link with capacity C i → j + ∆ . By Lemma 16,for each link ( i, j ) there exists a channel simulation code for link ( i, j ) of rate C i → j + ∆ andtotal variational distance at most d ( i → j ) n , where d ( i → j ) n → as n → ∞ . For each link ( i, j ) ∈ E ,we use N copies of the associated channel simulation code to simulate the behavior of link ( i, j ) in network N using the corresponding link on ¯ N ∆ . We analyze the impact on the overallprobability of error from replacing these noisy channels by channel simulation codes as follows.Let P X , Y , W , ˆ W by the joint distribution of all channel inputs X , channel outputs Y , messages W ,and message estimates ˆ W for the pipelined code on noisy network N . Similarly, let Q X , Y , W , ˆ W be the joint distribution of the same random variables on the code on noiseless network ¯ N ∆ constructed out of channel simulation codes. Note that in the latter, X and Y are not real channelinputs and outputs, but rather simulated inputs and outputs that feed into the channel simulationcodes, used to simulate noisy links with noiseless links. Since each channel simulation codeused on an n -length block for link ( i, j ) results in total variational distance at most d ( i → j ) n , wemay bound d TV ( P X , Y , W , ˆ W , Q X , Y , W , ˆ W ) ≤ X ( i,j ) ∈E N d ( i → j ) n . (314)The probability of error for the code on the noiseless network ¯ N ∆ differs from that on theoriginal noisy network by at most the quantity in (314). Because total variational distance is an upper bound on the difference in the probability of any event between the two distributions, theprobability of error of the resulting code on ¯ N ∆ is at most − (1 − ǫ ) N + X ( i,j ) ∈E N d ( i → j ) n ≤ −

12 (1 − ǫ ) N (315)where the inequality holds for sufﬁciently large n , since each sequence d ( i → j ) n vanishes with n .Recall that the channel simulation codes described in Lemma 16 employ common randomness U between the transmitter and receiver of each link. Thus, a direct application of Lemma 16 impliesonly the existence of a code achieving the probability in (315) if nodes are allowed commonrandomness. However, we may treat this common randomness as a randomized codebook, andemploy a usual random coding argument to show that there exists at least one deterministic codeachieving (315). Hence, for sufﬁciently large n , NN + d R ∈ R (cid:18) ¯ N ∆ , n ( N + d ) , −

12 (1 − ǫ ) N (cid:19) . (316)We now apply Lemma 9 on ¯ N ∆ , to ﬁnd that for any ˜ ǫ > and for sufﬁciently large n , we have NN + d R ∈ R ( ¯ N ∆ , n ( N + d ) , ˜ ǫ, η (˜ ǫ, d ) − dN log(1 − ǫ ) + 3 d ) (317)where η (˜ ǫ, d ) is deﬁned in (42).Let ¯ N − ∆ be the noiseless network where each link ( i, j ) is replaced by a noiseless one withcapacity C i → j − ∆ . By the assumption that ∆ < C min , we always have C i → j − ∆ > . We mayconvert the code on ¯ N ∆ to one on ¯ N − ∆ by stretching each block of n to one of length n ′ = C min + ∆ C min − ∆ n. (318)Thus NN + d · C min − ∆ C min + ∆ R ∈ R ( ¯ N − ∆ , n ′ ( N + d ) , ˜ ǫ, η (˜ ǫ, d ) − dN log(1 − ǫ ) + 3 d ) . (319)Now we use ordinary noisy channel codes to convert this code back to one on N , again oneblock (now of length n ′ ) at a time. For any N and sufﬁciently large n , the probability of anerror occurring on any of these channel codes can be made at most ˜ ǫ . Thus we have NN + d · C min − ∆ C min + ∆ R ∈ R ( N , n ′ ( N + d ) , ǫ, η (˜ ǫ, d ) − dN log(1 − ǫ ) + 3 d ) . (320)As the above holds for any ˜ ǫ > , we may write NN + d · C min − ∆ C min + ∆ R ∈ \ ˜ ǫ> C ( N , (2˜ ǫ ) n , ( η (˜ ǫ, d ) − dN log(1 − ǫ ) + 3 d ) n ) (321) ⊆ \ ˜ ǫ> [ k ∈ N C ( N , (˜ ǫ ) n , ( k ) n ) . (322)Since we may take N to be arbitrarily large, and ∆ arbitrarily small, and we chose R to be anyachievable vector with respect to ǫ , by closure we have C ( N , ( ǫ ) n ) ⊆ \ ˜ ǫ> [ k ∈ N C ( N , (˜ ǫ ) n , ( k ) n ) . (323)By the equivalent form of the very weak edge removal property in (27) of Proposition 4, if veryweak edge removal holds, then the RHS of (323) equals C ( N , + ) , so the strong converse holds.R EFERENCES [1] T. Ho, M. Effros, and S. Jalali, “On equivalence between network topologies,” in

Proc. Forty-Eighth Annual AllertonConference , Monticello, IL, Oct. 2010.[2] S. Jalali, M. Effros, and T. Ho, “On the impact of a single edge on the network coding capacity,” in

Proc. InformationTheory and Applications Workshop (ITA) , San Diego, CA, Feb. 2011, pp. 1–5.[3] E. J. Lee, M. Langberg, and M. Effros, “Outer bounds and a functional study of the edge removal problem,” in

Proc. IEEEInformation Theory Workshop , Sevilla, Spain, Sep. 2013, pp. 1–5.[4] S. U. Kamath, D. N. C. Tse, and V. Anantharam, “Generalized network sharing outer bound and the two-unicast problem,”in

Proc. International Symposium on Network Coding (NetCod) , Beijing, China, Jul. 2011.[5] R. W. Yeung, “A framework for linear information inequalities,”

IEEE Trans. Inf. Theory , vol. 43, no. 6, pp. 1924–1934,Nov. 1997.[6] M. Langberg and M. Effros, “Network coding: Is zero error always possible?” in

Proc. Forty-Nine Annual AllertonConference , Monticello, IL, Sep. 2011, pp. 1–8.[7] T. H. Chan and A. Grant, “Network coding capacity regions via entropy functions,”

IEEE Trans. Inf. Theory , vol. 60,no. 9, pp. 5347–5374, Sept 2014.[8] M. F. Wong, M. Langberg, and M. Effros, “On a capacity equivalence between network and index coding and the edgeremoval problem,” in , July 2013, pp. 972–976.[9] P. Noorzad, M. Effros, M. Langberg, and T. Ho, “On the power of cooperation: Can a little help a lot?” in , June 2014, pp. 3132–3136.[10] P. Noorzad, M. Effros, and M. Langberg, “On the cost and beneﬁt of cooperation,” in , June 2015, pp. 36–40.[11] ——, “Can negligible cooperation increase network reliability?” in , July 2016, pp. 1784–1788.[12] ——, “The unbounded beneﬁt of encoder cooperation for the k -user MAC,” in , July 2016, pp. 340–344.[13] ——, “Can negligible rate increase network reliability?” IEEE Trans. Inf. Theory , vol. 64, no. 6, pp. 4282–4293, June2018.[14] ——, “The beneﬁt of encoder cooperation in the presence of state information,” in , 2017.[15] M. Langberg and M. Effros, “On the capacity advantage of a single bit,” in ,Dec 2016, pp. 1–6. [16] W. Gu, “On achievable rate regions for source coding over networks,” Ph.D. dissertation, California Institute of Technology,2009.[17] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Channel coding rate in the ﬁnite blocklength regime,” IEEE Trans. Inf. Theory ,vol. 56, pp. 2307–2359, 2010.[18] J. Wolfowitz, “The coding of messages subject to chance errors,”

Illinois Journal of Mathematics , vol. 1, no. 4, pp.591–606, 1957.[19] A. Winter, “Coding theorem and strong converse for quantum channels,”

IEEE Trans. Inf. Theory , vol. 45, no. 7, pp.2481–2485, Nov 1999.[20] T. Ogawa and H. Nagaoka, “Strong converse to the quantum channel coding theorem,”

IEEE Trans. Inf. Theory , vol. 45,no. 7, pp. 2486–2489, Nov 1999.[21] S. L. Fong and V. Y. F. Tan, “Strong converse theorems for discrete memoryless networks with tight cut-set bound,” in , June 2017, pp. 933–937.[22] S. Arimoto, “On the converse to the coding theorem for discrete memoryless channels (corresp.),”

IEEE Trans. Inf. Theory ,vol. 19, no. 3, pp. 357–359, May 1973.[23] G. Dueck and J. Korner, “Reliability function of a discrete memoryless channel at rates above capacity (corresp.),”

IEEETrans. Inf. Theory , vol. 25, no. 1, pp. 82–85, Jan 1979.[24] Y. Oohama, “Strong converse exponent for degraded broadcast channels at rates outside the capacity region,” in , June 2015, pp. 939–943.[25] ——, “Exponent function for one helper source coding problem at rates outside the rate region,” in , June 2015, pp. 1575–1579.[26] ——, “Exponent function for asymmetric broadcast channels at rates outside the capacity region,” in , Oct 2016, pp. 537–541.[27] ——, “Exponent function for Wyner-Ziv source coding problem at rates below the rate distortion function,” in , Oct 2016, pp. 171–175.[28] K. Marton, “A simple proof of the blowing-up lemma (corresp.),”

IEEE Trans. Inf. Theory , vol. 32, no. 3, pp. 445–446,May 1986.[29] A. El Gamal and Y. Kim,

Network Information Theory . Cambridge University Press, 2011.[30] W. Gu, M. Effros, and M. Bakshi, “A continuity theory for lossless source coding over networks,” in , Sept 2008, pp. 1527–1534.[31] M. Langberg and M. Effros, “Source coding for dependent sources,” in

Information Theory Workshop (ITW), 2012 IEEE ,Sept 2012, pp. 70–74.[32] R. Ahlswede, P. G´acs, and J. K¨orner, “Bounds on conditional probabilities with applications in multi-user communication,”

Z. Wahrscheinlichkeitstheorie verw. Gebiete , vol. 34, pp. 157–177, 1976.[33] M. Raginsky and I. Sason, “Concentration of measure inequalities in information theory, communications, and coding,”

Foundations and Trends in Communications and Information Theory , vol. 10, no. 1-2, pp. 1–246, 2013.[34] V. Strassen, “The existence of probability measures with given marginals,”

Ann. Math. Statist. , vol. 36, pp. 423–439, 1965.[35] R. Koetter, M. Effros, and M. M´edard, “A theory of network equivalence—Part I: Point-to-point channels,”

IEEE Trans.Inf. Theory , vol. 57, no. 2, pp. 972–995, 2011.[36] T. Cover and A. E. Gamal, “Capacity theorems for the relay channel,”

IEEE Trans. Inf. Theory , vol. 25, no. 5, pp. 572–584,September 1979.[37] A. El Gamal, “On information ﬂow in relay networks,” in

Proc. IEEE National Telecomm. Conf. , vol. 2, New Orleans,LA, Nov. 1981, pp. D4.1.1–D4.1.4. [38] A. El Gamal and M. Aref, “The capacity of the semideterministic relay channel (corresp.),” IEEE Trans. Inf. Theory ,vol. 28, no. 3, pp. 536–536, May 1982.[39] A. S. Avestimehr, S. N. Diggavi, and D. N. C. Tse, “Wireless network information ﬂow: A deterministic approach,”

IEEETrans. Inf. Theory , vol. 57, no. 4, pp. 1872–1905, April 2011.[40] Y. Oohama, “Strong converse theorems for degraded broadcast channels with feedback,” in , June 2015, pp. 2510–2514.[41] J. Liu, T. A. Courtade, P. Cuff, and S. Verd´u, “Smoothing Brascamp-Lieb inequalities and strong converses for commonrandomness generation,” in , July 2016, pp. 1043–1047.[42] R. Ahlswede and I. Csisz´ar, “Common randomness in information theory and cryptography. II. CR capacity,”

IEEE Trans.Inf. Theory , vol. 44, no. 1, pp. 225–240, Jan 1998.[43] H. Sato, “The capacity of the Gaussian interference channel under strong interference (corresp.),”

IEEE Trans. Inf. Theory ,vol. 27, no. 6, pp. 786–788, Nov 1981.[44] M. Costa and A. El Gamal, “The capacity region of the discrete memoryless interference channel with strong interference(corresp.),”

IEEE Trans. Inf. Theory , vol. 33, no. 5, pp. 710–711, Sep 1987.[45] S. Q. Le, V. Y. F. Tan, and M. Motani, “A case where interference does not affect the channel dispersion,”

IEEE Trans.Inf. Theory , vol. 61, no. 5, pp. 2439–2453, May 2015.[46] S. L. Fong and V. Y. F. Tan, “A proof of the strong converse theorem for Gaussian multiple access channels,”

IEEE Trans.Inf. Theory , vol. 62, no. 8, pp. 4376–4394, Aug 2016.[47] G. Dueck, “The strong converse to the coding theorem for the multiple-access channel,”

J. Combinat., Inf. Syst. Sci , vol. 6,no. 3, pp. 187–196, 1981.[48] R. Ahlswede, “An elementary proof of the strong converse theorem for the multiple access channel,”

J. Combinat., Inf.Syst. Sci. , vol. 7, no. 3, pp. 216–230, 1982.[49] R. Roth,

Introduction to Coding Theory . New York, NY, USA: Cambridge University Press, 2006.[50] R. G. Gallager,

Information Theory and Reliable Communication . New York, NY, USA: John Wiley & Sons, Inc., 1968.[51] T. M. Cover and J. Thomas,

Elements of Information Theory . John Wiley, 1991.[52] S. Borade and L. Zheng, “Euclidean information theory,” in ,March 2008, pp. 14–17.[53] Y. Xiang and Y.-H. Kim, “A few meta-theorems in network information theory,” in

Information Theory Workshop (ITW),2014 IEEE , Nov 2014, pp. 77–81.[54] O. Kosut and J. Kliewer, “Equivalence for networks with adversarial state,”

IEEE Trans. Inf. Theory , vol. 63, no. 7, pp.4137–4154, July 2017.[55] R. W. Yeung,