[PDF] Tension Bounds for Information Complexity

Abstract

The main contribution of this work is to relate information complexity to "tension" [Prabhakaran and Prabhakaran, 2014] - an information-theoretic quantity defined with no reference to protocols - and to illustrate that it allows deriving strong lower-bounds on information complexity. In particular, we use a very special case of this connection to give a quantitatively tighter connection between information complexity and discrepancy than the one in the work of Braverman and Weinstein (2012) (albeit, restricted to independent inputs). Further, as tension is in fact a multi-dimensional notion, it enables us to bound the 2-dimensional region that represents the trade-off between the amounts of communication in the two directions in a 2-party protocol. This work is also intended to highlight tension as a fundamental measure of correlation between a pair of random variables, with rich connections to a variety of questions in computer science and information theory.

Full PDF

TTension Bounds for Information Complexity

Manoj M. Prabhakaran ∗ Vinod M. Prabhakaran † January 13, 2018

Abstract

The main contribution of this work is to relate information complexity to “tension” [PP14] – an information-theoretic quantity deﬁned with no reference to protocols – and to illustrate that it allows deriving stronglower-bounds on information complexity. In particular, we use a very special case of this connection to givea quantitatively tighter connection between information complexity and discrepancy than the one in [BW12](albeit, restricted to independent inputs). Further, as tension is in fact a multi-dimensional notion, it enablesus to bound the 2-dimensional region that represents the trade-off between the amounts of communication inthe two directions, in a 2-party protocol.This work is also intended to highlight tension as a fundamental measure of correlation between a pair ofrandom variables, with rich connections to a variety of questions in computer science and information theory. ∗ Department of Computer Science, University of Illinois, Urbana-Champaign. [email protected] . † School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, India. [email protected] a r X i v : . [ c s . CC ] A ug Introduction

Communication complexity, since the seminal work of Yao [Yao79], has been a central question in theoreticalcomputer science. Many of the recent advances in this area have centred around the notion of informationcomplexity, which measures the amount of information about the inputs – rather than the number of bits – thatshould be present in a protocol’s transcript, if it should compute a function (somewhat) correctly.The main contribution of this work is to relate information complexity to “tension” [PP14] – an information-theoretic quantity deﬁned with no reference to protocols – and to illustrate that it allows deriving strong boundson information complexity. In particular, we use a very special case of this connection to give a quantitativelytighter connection between information complexity and discrepancy than the one in [BW12] (albeit, restrictedto independent inputs). Further, as tension is in fact a multi-dimensional notion, it enables us to bound the2-dimensional region that represents the trade-off between the amounts of communication in the two directions,in a 2-party protocol.This work is also intended to highlight tension as a fundamental measure of correlation between a pair ofrandom variables, with rich connections to a variety of questions in computer science and information theory.Tension is intimately related to the notion of common information developed in highly inﬂuential works in theinformation theory literature from the 70’s [GK73, Wyn75]. Tension has proven useful in deriving state-of-the-art bounds on “cryptographic complexity” (i.e., number of instances of, say, oblivious transfer needed perinstance of securely computing a function) [PP14] and communication complexity of information-theoreticallysecure multiparty computation [DPP14]. However, currently we have few tools to compute (or bound) tension.We leave it as an important problem to understand tension in general as well as for speciﬁc random variables.

What is Tension?

Tension of a pair of correlated random variables ( A ; B ) captures “non-trivial” correlationbetween them: i.e., the extent to which correlation cannot be captured by a common random variable that canbe associated with both A and B . The question of how well correlation can be captured by a random variable isformulated in terms of “common information.” Two different notions of common information were developedin the 70’s, CI GK ( A ; B ) by Gács-Körner [GK73], and CI Wyn ( A ; B ) by Wyner [Wyn75], with operationalmeanings related to certain natural information theoretic problems. (See Appendix A for more details.) Onecan deﬁne corresponding notions of tension as the gap between mutual information (which accounts for all thecorrelation, but may not correspond to a common random variable) and common information. More precisely,one can deﬁne the non-negative tension quantities T GK ( A ; B ) = I ( A ; B ) − CI GK ( A ; B ) and T Wyn ( A ; B ) = CI Wyn ( A ; B ) − I ( A ; B ) . These notions of tension were identiﬁed in [PP14] as special cases of a uniﬁed 3-dimensional notion of tension region .In [PP14], an operational meaning was attached to tension region in terms of a communication problem, andalso it was shown that a secure will need a large number of instances of oblivious transfer. In Appendix A, we summarize some of the basicproperties of the tension region, as developed in [PP14].We lower bound the information complexity of a function f in terms of how different the tension regions of ( X ; Y ) and ( X, Z ; Y, Z ) are, where Z = f ( X, Y ) (or rather, Pr[ Z = f ( X, Y )] ≥ + (cid:15) ). In particular, whenthe inputs ( X ; Y ) are independent of each other (so that their tension is zero, and hence contains the origin),the information complexity region is shown to lie inside the tension region of ( X, Z ; Y, Z ) . (An informationcomplexity region farther from the origin corresponds to a higher lower-bound on information complexity.) Notethat even though Z may be a single bit, the difference between the tension regions of ( X ; Y ) and ( XZ ; Y Z ) could be quite large – as we illustrate by the connection with discrepancy. Our contributions are in two parts: Informally, the farther the tension region is from the origin, the higher the tension, along different dimensions.

1. We show that information complexity can be lower-bounded using tension – a fundamental quantity deﬁnedwith no reference to protocols.2. We illustrate the potential of this approach for yielding strong lower-bounds, by obtaining an improvedlower-bound on information complexity in terms of discrepancy.Below, we shall elaborate on these contributions further. We point out that our model and results are, in someways, more general than prior work:• In developing the connection between information complexity and tension (as well as between informationcomplexity and communication complexity), we work with a “bigger picture” that considers 2-dimensionalnotions of these quantities. We remark that even if we are interested only in bounding communication com-plexity and information complexity (corresponding to 1-dimensional regions), using bounds in terms of the2-dimensional region can yield potentially stronger lower-bounds.• Our results hold for randomized functions, with asymmetric outputs.• A minor difference is that in our communication model, we allow for the possibility that the transcript (i.e.,the concatenation of all the messages sent during the protocol in either direction) may not be “parsable” intoindividual messages by an outsider, though each party, with its input can parse it. (See Footnote 4.)We propose, as a direction for further study, that various results on information complexity which led toadvances in communication complexity can be rederived for tension, thereby providing alternate (and hopefullysimpler) proofs to these results. Also, we leave it as an open problem to exploit the full power of the tensionbounds: currently, there are few techniques to map out the full 3-dimensional tension region of a pair of randomvariables.

Tension, Information Complexity and Communication Complexity

The basic idea behind lower-bounding information complexity by tension is, in fact, easy to see. Considera protocol in which, for simplicity, the two parties are given independent inputs

X, Y , exchange messages togenerate a transcript M , and produces a common output Z . Since X, Y were independent of each other, we knowthat ( X, Z ) and ( Y, Z ) should continue to be independent conditioned on the transcript, M ; i.e., ( X, Z ) − M − ( Y, Z ) . One can see that the information cost of this protocol I ( X ; M | Y ) + I ( Y ; M | X ) can be lower boundedby I ( XZ ; M | Y Z ) + I ( Y Z ; M | XZ ) , which in turn can be lower bounded by inf Q : XZ − Q − Y Z I ( XZ ; Q | Y Z ) + I ( Y Z ; Q | XZ ) (i.e., without requiring that Q is the transcript of a protocol that outputs Z , but only that XZ − Q − Y Z ). The latter quantity is exactly the Wyner-Tension, T Wyn ( XZ ; Y Z ) . When ( X, Y ) are not independent,this lower-bound changes to T Wyn ( XZ ; Y Z ) − T Wyn ( X ; Y ) . Jumping ahead, we mention that we can extendthis basic lower-bound to a more general one, where we also consider Q such that the condition XZ − Q − Y Z is replaced by I ( XZ ; Y Z | Q ) ≤ c for c ≥ (this is of interest only when X, Y are correlated).We derive our lower-bounds in terms of 2-dimensional regions, which can potentially yield stronger lowerbounds than considering the two points T Wyn ( XZ ; Y Z ) and T Wyn ( X ; Y ) on the one-dimensional line. Thegeneral relation between communication complexity and information complexity, and that between informationcomplexity and tension (Theorem 3 and Theorem 1) can be summarized as C ⊆ I ⊆ R , where C denotes the set of communication cost pairs (number of bits from Alice to Bob, and vice-versa) achiev-able by protocols computing a possibly randomized function f , I denotes the information cost pairs (informa-tion communicated by Alice to Bob about her input, and vice versa) achievable by such protocols, and R , asdescribed below, denotes a 2-dimensional restriction of the 3-dimensional “tension region” that was introduced2n [PP14]. Here, all three regions are deﬁned to be “upward closed” subsets of R : i.e., if ( x, y ) is in the set andthen so is ( x (cid:48) , y (cid:48) ) for all x (cid:48) ≥ x and y (cid:48) ≥ y .Before fully describing R , for simplicity, consider the case of independent X, Y . In this case, R is given by T ( XZ ; Y Z ) = { ( r , r ) ∈ R : ∃ Q s.t. XZ − Q − Y Z and I ( XZ ; Q | Y Z ) ≤ r , I ( Y Z ; Q | XZ ) ≤ r } . This is a convex, upward-closed region, typically bounded away from the origin. In the more general case, when

X, Y are not independent, R is somewhat more complex. In particular, it is contained in the region T ( XZ ; Y Z ) − T ( X ; Y ) = { ( r , r ) ∈ R : ( r , r ) + T ( X ; Y ) ⊆ T ( XZ ; Y Z ) } . Typically, we expect the region T ( XZ ; Y Z ) to be much further away from the origin than T ( X ; Y ) (i.e., ( XZ ; Y Z ) has much higher tension than ( X ; Y ) ). The region T ( XZ ; Y Z ) − T ( X ; Y ) (or rather, the lowerboundary of it) captures the least amount by which T ( X ; Y ) should be pushed away from the origin so that itmoves completely inside T ( XZ ; Y Z ) . The bound T Wyn ( XZ ; Y Z ) − T Wyn ( X ; Y ) mentioned earlier, can beobtained as inf ( a,b ) ∈ T ( XZ ; Y Z ) ( a + b ) − inf ( a,b ) ∈ T ( X ; Y ) ( a + b ) ≤ inf ( a,b ) ∈ T ( XZ ; Y Z ) − T ( X ; Y ) ( a + b ) . Here we point out that the inequality above could be strict, in which case settling for a 1-dimensional versionwould give a weaker bound than what is implied by the 2-dimensional version.The full deﬁnition of R is ∩ c ≥ T c ( XZ ; Y Z ) − T c ( X ; Y ) , where in T c ( XZ ; Y Z ) we do not restrict to Q such that XZ − Q − Y Z ; instead we require only that I ( XZ ; Y Z | Q ) ≤ c . In showing that R gives a validouter-bound on I , we rely on a certain “monotonicity” property of the 3-dimensional tension region of the viewsof the parties in a protocol: the tension region can only extend closer to the origin as the protocol progresses. While quite general in its form, we leave it as an open problem to exploit the full power of this connection,since understanding the full 3-dimensional tension region is an outstanding challenge.

Information Complexity vs. Communication Complexity.

As mentioned above, the connection between in-formation complexity and communication complexity is well-known. We extend this relation to the 2-dimensionalregions C and I . Note that C corresponds to average communication-complexity. Hence C ⊆ I directly yields alower bounds not just on worst-case communication complexity (as it is often presented in the literature), but infact on average communication complexity as well. This allows one to translate lower-bounds on informationcomplexity of protocols of a certain error rate to lower-bounds on average communication complexity for thesame error rate.

Discrepancy vs. Tension

Consider

X, Y being n -bit long strings, and Z being a single bit with Pr[ Z = f ( X, Y )] ≥ + (cid:15) , where f is, say, the inner-product over GF (2) . When X, Y are independent, T Wyn ( X ; Y ) = 0 . One would wonder ifadding a single bit to the random variables can change their tension by more than a constant amount. But asit turns out, the correlation between XZ, Y Z as captured by T Wyn can be Ω( n ) bits! For this, we rely on thefunction f having an exponentially small “discrepancy,” a combinatorial measure of complexity of a function.Indeed, in Section 5 we show that the Wyner-Tension T Wyn ( XZ ; Y Z ) , where X, Y are independent, and

Pr[ Z = f ( X, Y )] ≥ + (cid:15) , can be lower-bounded as Ω( (cid:15) log (cid:15) ∆ ) if the discrepancy of f (w.r.t. the distributionof ( X, Y ) ) is upper-bounded by ∆ . This compares favorably with a similar bound in [BW12], of the form Ω( (cid:15) log (cid:15) ∆ ) (though, as mentioned above, the bound in [BW12] applies even if X, Y are not independent). A more general monotonicity property holds, allowing the parties to not just exchange messages, but also to “securely” delete partsof their views. This was shown in [PP14] for all of the tension region, including T Wyn ; a similar result appeared for T GK and two otherpoints in the tension region in an earlier work of Wolf and Wullschleger [WW05]. In fact, we observe that the inequality IC µ (Π) ≤ CC (Π) [BR11] used to relate information cost and worst-case communicationcost of a protocol can in fact be strengthened to IC µ (Π) ≤ CC µ (Π) ≤ CC (Π) , for any distribution µ over the inputs. (See Lemma 1.)

3o lower-bound T Wyn ( XZ ; Y Z ) it turns out to be enough to lower-bound I ( XY ; Q ) such that X − Q − Y and given Q , Z is determined (i.e., H ( Z | Q ) = 0 ). The high-level intuition is to analyze the advantage Z has(i.e., Pr[ Z = f ( X, Y )] − ) as contributed by different values of Q . For starters, suppose the input distributionis uniform and further, for each value q for Q , the conditional distribution p XY | Q = q is also uniform over arectangle . Then, for q such that this rectangle is large, its contribution to the advantage will be small, becauseotherwise it will result in a large discrepancy (recall that Z must take a single value conditioned on Q = q ).Thus, to achieve a large advantage when the discrepancy is small, most of the mass on Q should correspondto q such that p XY | Q = q is uniform over a “small” rectangle. Intuitively, this should imply a large value for I ( XY ; Q ) .This idea runs into several complications. Mainly, p XY | Q = q is guaranteed only to be a product distribu-tion, and not necessarily uniform over its support. To tackle this, we show how to slice this distribution intoseveral components, each of which is indeed uniform (or more generally, when XY is not uniform, each oneis p XY | ( X,Y ) ∈ r for some rectangle r ). One could then repeat the above argument with respect to the slices.However, including the index of the slice into Q would result in a large gap between its mutual information with XY , and that of the original Q . Instead we add a single bit to Q to indicate whether the slice is a large rectangleor a small rectangle. We then argue that collecting the small rectangles into one single subset will still result ina (relatively) small subset. With this, the above outline can indeed be made to work.We remark that the intuition that if, for most q , the support of p XY | Q = q has a small mass in the originaldistribution p XY , then I ( XY ; Q ) should be large is formalized in Lemma 2. This may be of independentinterest. Many of the recent advances in the ﬁeld of communication complexity [Yao79] have followed from usingvarious notions of information complexity. Earlier notions of information complexity appeared implicitly inseveral works [Abl96, PRV01, SS02], and was ﬁrst explicitly deﬁned in [CSWY01]. The current notion of(internal) information complexity originated in [BYJKS04]. Information complexity has been extensively usedin in the recent communication complexity literature [BR11, Bra12, BW12, CKW12, KLL +

12, BBCR13]. Thenotion was also adapted to specialized models or tasks [JKS03, JRS03, JRS05, HJMR10]. The result in [BW12](since generalized by [KLL + Notation.

For brevity of notation, we shall often denote the random-variables ( X, Y ) etc. by XY etc. Also, weshall often use a random variable to denote the probability distribution of the random variable, when the randomvariables that it is jointly distributed with are clear from the context: i.e., we may write Q instead of p Q | XY . Wewrite A − Q − B to indicate that I ( A ; B | Q ) = 0 . Communication Complexity.

Let Π( X ; Y ) be a (randomized) 2-party protocol with inputs to the two partiesbeing X and Y respectively. The two parties alternate sending messages to each other; Π speciﬁes which partysends the ﬁrst message, and the function mapping each party’s current view to the distribution over the nextmessage that it sends, and a distribution over an optional output it produces (on producing an output, the party4alts). The messages can be of arbitrary length, but should be self-terminating given the transcript so far, andeither of the two inputs. For simplicity, we do not include public coins in our model; however, with suitablemodiﬁcations in the deﬁnitions, all our results would continue to hold in such a model. In particular, we notethat tension between two random variables is not altered by adding a common random variable (i.e., the publicrandom tape) to both the random variables.We write Π( X ; Y ) (cid:55)→ ( A ; B ) to denote that the random variables ( A ; B ) (jointly distributed with ( X ; Y ) )are the outputs produced by the two parties on running Π( X ; Y ) . We denote by CC (12) XY (Π) (respectively, CC (21) XY (Π) ) the expected number of bits sent by party 1 to party 2 (respectively, by party 2 to party 1) in theprotocol Π( X ; Y ) ; the expectation is over the randomness of the protocol, as well as the input distribution p XY .The communication complexity – or more precisely, the “achievable communication rate region” – forcomputing ( A ; B ) given ( X ; Y ) , is deﬁned as: C ( A ; B : X ; Y ) = { ( r , r ) ∈ R : ∃ Π s.t. Π( X ; Y ) (cid:55)→ ( A ; B ) and CC (12) XY (Π) ≤ r , CC (21) XY (Π) ≤ r } . Note that the region C ( A ; B : X ; Y ) is an upward closed region . In fact, the different regions we shall deﬁneand use are all upward closed.A special case of interest is when the A = B = f ( X, Y ) , for a boolean function f : X × Y → { , } . Inthis case we shall typically require of a protocol that the two parties agree on the outcome, but we shall allowthe outcome to be wrong with some probability (cid:15) (probability taken over the input distribution as well as therandomness of the protocol). We deﬁne the communication complexity region for f (for an error probability (cid:15) )to be: C (cid:15) ( f : X ; Y ) = (cid:91) p Z | XY :SD( p ZXY , p f ( X,Y ) XY ) ≤ (cid:15) C ( Z ; Z : X ; Y ) , where SD( p A , p B ) is the total variation distance between the distributions p A , p B deﬁned as SD( p A , p B ) = (cid:80) a | p A ( a ) − p B ( a ) | . Also of special interest is the (average-case) communication complexity , which considersjust the total number of bits communicated, irrespective of the direction: CC (cid:15) XY ( f ) = inf { r + r : ( r , r ) ∈ C (cid:15) ( f : X ; Y ) } . Information Complexity.

The information cost of a protocol Π is deﬁned as follows. Let Π( X ; Y ) (cid:55)→ ( A ; B ) and let M denote the transcript of Π( X ; Y ) . Then we deﬁne IC (12) XY (Π) = I ( X ; M | Y ) , IC (21) XY (Π) = I ( Y ; M | X ) . Then, IC XY (Π) = IC (12) XY (Π) + IC (21) XY (Π) . We deﬁne the information complexity region as: I ( A ; B : X ; Y ) = { ( r , r ) ∈ R : ∃ Π s.t. Π( X ; Y ) (cid:55)→ ( A ; B ) and IC (12) XY (Π) ≤ r , IC (21) XY (Π) ≤ r } . Of special interest is the following quantity — the information complexity of computing Z from ( X ; Y ) . IC XY ( Z ) = inf { r + r : ( r , r ) ∈ I ( Z ; Z : X ; Y ) } . Discrepancy.

Let R = {X (cid:48) × Y (cid:48) : X (cid:48) ⊆ X , Y (cid:48) ⊆ Y} ), the set of all “rectangles” in X × Y . Then, given adistribution p XY over X × Y , and a boolean function f : X × Y → { , } , we deﬁne Disc XY ( f ) = max r ∈R | Pr[(

X, Y ) ∈ r ∧ f ( X, Y ) = 0] − Pr[(

X, Y ) ∈ r ∧ f ( X, Y ) = 1] | = max X (cid:48) ⊆X , Y (cid:48) ⊆Y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( x,y ) ∈X (cid:48) ×Y (cid:48) : f ( x,y )=0 p XY ( x, y ) − (cid:88) ( x,y ) ∈X (cid:48) ×Y (cid:48) : f ( x,y )=1 p XY ( x, y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . The traditional deﬁnition of a protocol in the communication complexity literature is slightly more restrictive: it requires that themessages are self-truncating, given just the transcript so far. We note that when the two parties have correlated inputs (e.g., as part oftheir private inputs, they share a one-time pad which is used to mask the entire communication) this should no more be required. .1 Tension The tension region of a pair of random variables was deﬁned in [PP14] as the following upward closed region.

Deﬁnition 1.

For a pair of random variables

A, B , their tension region T ( A ; B ) is deﬁned as T ( A ; B ) = { ( r , r , r ) : ∃ Q jointly distributed with A, B s.t. I ( B ; Q | A ) ≤ r , I ( A ; Q | B ) ≤ r , I ( A ; B | Q ) ≤ r } . As shown in [PP14], without loss of generality, we may assume a cardinality bound |Q| ≤ |A||B| + 2 on thealphabet Q in the above deﬁnition, where A and B are the alphabets of A and B , respectively. It was also shownthere that T ( A ; B ) has the interpretation as a rate-information tradeoff region for a distributed common ran-domness generation problem which generalizes the common randomness problem of Gács and Körner [GK73]. T ( A ; B ) is a closed, convex region, with the following monotonicty property for randomized (public/privatecoins) protocols: Suppose X , Y are the inputs and A , B the outputs of the parties under a protocol. Let M denote the transcript of the protocol. Let V A = ( X, A, M ) and V B = ( Y, B, M ) denote the views of the partiesat the end of the protocol. Proposition 1 (Theorem 5.4 of [PP14]). T ( V A ; V B ) ⊇ T ( X ; Y ) . In the sequel we will apply certain implications of the above result. Speciﬁcally, we will be interested in theinclusion relationship of certain restrictions of the tension regions of inputs and the views. For convenience, wedeﬁne for c ≥ the intersection of tension region with the plane r = c as T c . More precisely, T c ( A ; B ) = { ( r , r ) ∈ R : ( r , r , c ) ∈ T ( A ; B ) } = { ( r , r ) ∈ R : ∃ p Q | A,B s.t. I ( B ; Q | A ) ≤ r , I ( A ; Q | B ) ≤ r , I ( A ; B | Q ) ≤ c } . The case of c = 0 will be of special interest to us. Here, we will focus on the minimum r + r . We deﬁne the Wyner-tension T Wyn ( A ; B ) of two jointly distributed random variables A, B as T Wyn ( A ; B ) = inf { r + r : ( r , r ) ∈ T ( A ; B ) } = inf p Q | AB : A − Q − B I ( A ; Q | B ) + I ( B ; Q | A ) . This quantity is related to Wyner’s common information CI Wyn ( A ; B ) of two random variables A, B [Wyn75]. CI Wyn ( A ; B ) = inf p Q | AB : A − Q − B I ( A, B ; Q ) . It is easy to see the following [PP14]. T Wyn ( A ; B ) = CI Wyn ( A ; B ) − I ( A ; B ) . Notice that CI Wyn ( A ; B ) ≥ I ( A ; B ) and T Wyn ( A ; B ) ≥ . In this section, we lower-bound information complexity in terms of tension. As we shall work with the moregeneral information complexity region I ( A ; B : X ; Y ) , the “lower-bound” corresponds to bounding the regionaway from the origin. For this, we shall deﬁne a region R ( A ; B : X ; Y ) ⊆ R , which will then be used oouter-bound the region I ( A ; B : X ; Y ) . We deﬁne: R ( A ; B : X ; Y ) = (cid:92) c ≥ (cid:0) T c ( B, Y ; A, X ) − T c ( Y ; X ) (cid:1) , S − S = { ( a, b ) ∈ R : ( a, b ) + S ⊆ S } and ( a, b ) + S , for a, b ∈ R and S ⊆ R , is { ( x, y ) ∈ R :( x + a, y + b ) ∈ S } . We also deﬁne (cid:101) R ( A ; B : X ; Y ) = ( H ( B | Y ) − H ( AB | XY ) , H ( A | X ) − H ( AB | XY )) + R ( A ; B : X ; Y ) . Note that if H ( A | X ) ≥ H ( AB | XY ) and H ( B | Y ) ≥ H ( AB | XY ) , then (cid:101) R ( A ; B : X ; Y ) ⊆ R ( A ; B : X ; Y ) .These conditions are satisﬁed if, for instance, A = B (both parties output the same value), or H ( A, B | X, Y ) = 0 (the output is a deterministic function of the input), or more generally if H ( A | B, X, Y ) = H ( B | A, X, Y ) = 0 (i.e., any randomness in the outputs given the inputs is common to both outputs). Even if these conditions are notsatisﬁed, if the outputs A and B are short, then (cid:101) R ( A ; B : X ; Y ) is close to R ( A ; B : X ; Y ) , and the differencebetween the two can be ignored. Theorem 1. I ( A ; B : X ; Y ) ⊆ (cid:101) R ( A ; B : X ; Y ) . In particular, if H ( A | X ) ≥ H ( A, B | X, Y ) and H ( B | Y ) ≥ H ( A, B | X, Y ) , then, I ( A ; B : X ; Y ) ⊆ R ( A ; B : X ; Y ) . Proof.

For all

X, Y, Z , IC XY ( Z ) ≥ T Wyn ( XZ ; Y Z ) − T Wyn ( X ; Y ) . In particular, if X and Y are independent of each other, IC XY ( Z ) ≥ T Wyn ( XZ ; Y Z ) .Proof. Firstly, note that the condition in Theorem 1 holds when A = B = Z , since H ( Z | X ) ≤ H ( Z | XY ) and H ( Z | Y ) ≤ H ( Z | XY ) . Thus, I ( Z ; Z : X ; Y ) ⊆ R ( Z ; Z : X ; Y ) ⊆ T ( Y Z ; XZ ) − T ( Y ; X ) . Then, IC XY ( Z ) = inf ( a,b ) ∈ I ( Z ; Z : X ; Y ) ( a + b ) ≥ inf ( a,b ) ∈ T ( Y Z ; XZ ) − T ( Y ; X ) ( a + b ) . Now, ∀ ( a, b ) ∈ ( S − S ) ,we have S ⊇ ( a, b ) + S ; hence, inf ( r ,r ) ∈ S ( r + r ) ≤ inf ( a,b ) ∈ S − S ( a + b ) + inf ( r ,r ) ∈ S ( r + r ) . Recall that inf ( r ,r ) ∈ T U ; V ( r + r ) = T Wyn ( U ; V ) . Thus, IC XY ( Z ) ≥ T Wyn ( Y Z ; XZ ) − T Wyn ( Y ; X ) . The statement in the theorem follows from the symmetry of T Wyn . Below we show that the communication complexity region is outer-bounded by the information complexityregion. We start with Lemma 1 below, which relates the communication cost pair of a protocol to its informationcost pair. A simpliﬁed version of this result that has been used extensively, namely, IC XY (Π) ≤ CC (Π) ,appears in [BR11]. Note that from Lemma 1 it follows that, in fact, IC XY (Π) ≤ CC XY (Π) (and clearly, CC XY (Π) ≤ CC (Π) ). That is, the information-complexity lower-bound applies not just to the worst casecommunication complexity, but also to the average case communication complexity. Lemma 1.

For any protocol Π and input distribution ( X, Y ) , the following hold: IC (12) XY (Π) ≤ CC (12) XY (Π) , IC (21) XY (Π) ≤ CC (21) XY (Π) . In particular, IC XY (Π) ≤ CC XY (Π) . roof. We shall show that IC (12) XY (Π) ≤ CC (12) XY (Π) ; the second inequality follows similarly, and the third isobtained by adding the ﬁrst two inequalities. Below, the random variable M denotes the transcript of theprotocol Π with input ( X ; Y ) , M i denotes the i th bit of M , and M i denotes the ﬁrst i bits of M . For notationalconvenience, we deﬁne M i to be a ﬁxed symbol (say, 0) if i is greater than the length of M . Let M be the set ofall complete transcripts. Also, for m ∈ M , we write | m | to denote the (expected) number of bits in m thatare sent by party 1 to party 2 (expectation over either input), and similarly | m | to denote the bits in the otherdirection, so that | m | = | m | + | m | . IC (12) XY (Π) = I ( M ; X | Y ) = ∞ (cid:88) i =0 I ( M i +1 ; X | Y, M i )= ∞ (cid:88) i =0 (cid:88) m ∈{ , } i Pr[ M i = m ] · I ( M i +1 ; X | Y, M i = m )= ∞ (cid:88) i =0 (cid:88) m ∈{ , } i  (cid:88) (cid:98) m ∈M : m = (cid:98) m i Pr[ M = (cid:98) m ]  · I ( M i +1 ; X | Y, M i = m )= ∞ (cid:88) i =0 (cid:88) (cid:98) m ∈M Pr[ M = (cid:98) m ] I ( M i +1 ; X | Y, M i = (cid:98) m i )= (cid:88) (cid:98) m ∈M Pr[ M = (cid:98) m ] · | (cid:98) m |− (cid:88) i =0 I ( M i +1 ; X | Y, M i = (cid:98) m i ) (a) ≤ (cid:88) (cid:98) m ∈M Pr[ M = (cid:98) m ] · | (cid:98) m | = CC (12) XY (Π) where inequality (a) follows from the fact that, for each value of y , I ( M i +1 ; X | Y = y, M i = (cid:98) m i ) = 0 if,after (cid:98) m i (and given Y = y ), the next message is sent by Bob, and otherwise I ( M i +1 ; X | Y = y, M i = (cid:98) m i ) ≤ H ( M i +1 ) ≤ .The following theorem is an immediate consequence of Lemma 1. Theorem 3. C ( A ; B : X ; Y ) ⊆ I ( A ; B : X ; Y ) .Proof. Consider any protocol Π that takes ( X ; Y ) as input and outputs ( A ; B ) . By Lemma 1, IC (12) XY (Π) ≤ CC (12) XY (Π) and IC (21) XY (Π) ≤ CC (21) XY (Π) . Thus, by deﬁnition of I ( A ; B : X ; Y ) , ( CC (12) XY (Π) , CC (21) XY (Π)) ∈ I ( A ; B : X ; Y ) . Since this holds for all Π such that Π( X ; Y ) (cid:55)→ ( A ; B ) , and I ( A ; B : X ; Y ) is an upwardclosed region, the theorem follows.Following the deﬁnitions, the above theorem yields the following lower-bound: CC (cid:15) X ; Y ( f ) ≥ inf p Z | XY :SD( p ZXY , p f ( X,Y ) XY ) ≤ (cid:15) IC XY ( Z ) . Combining this with Corollary 2, we obtain the following lower-bound on (average-case) communication com-plexity. Since we do not require the transcripts to be parsable on their own without an input (see Footnote 4), strictly speaking, the set ofcomplete transcripts is not well-deﬁned. However, M can be deﬁned more loosely as, for instance, the set of all strings of length d ,where d is an upperbound on the worst-case communication cost of the protocol, and the arguments in the proof continue to hold. Infact, even if this cost is unbounded, but as long as the average cost CC (12) XY (Π) is bounded (otherwise the inequality is trivial to see), itis possible to extend the proof by considering d → ∞ . orollary 4. For all (cid:15) ≥ , CC (cid:15) X ; Y ( f ) ≥ inf p Z | XY :SD( p ZXY , p f ( X,Y ) XY ) ≤ (cid:15) T Wyn ( XZ ; Y Z ) − T Wyn ( X ; Y ) . In particular, if ( X, Y ) are independent of each other, CC (cid:15) X ; Y ( f ) ≥ inf p Z | XY :SD( p ZXY , p f ( X,Y ) XY ) ≤ (cid:15) T Wyn ( XZ ; Y Z ) . Theorem 5.

Suppose ( X, Y ) are independent random variables over X × Y , and f : X × Y → { , } is afunction with Disc XY ( f ) ≤ ∆ . Also, suppose Z is a binary random variable jointly distributed with ( X, Y ) such that Pr[ Z (cid:54) = f ( X, Y )] ≤ − (cid:15) . Then T Wyn ( XZ ; Y Z ) ≥ (cid:15) − (cid:15) ) log (cid:15) ∆ − . Proof.

1= ( I ( X ; Q | Y ) + I ( Y ; Q | X ) + I ( X ; Y )) − I ( X ; Y ) −

2= ( I ( XY ; Q ) + I ( X ; Y | Q )) − I ( X ; Y ) − I ( XY ; Q ) − I ( X ; Y ) − , where in the last step we used the fact that I ( X ; Y | Q ) = 0 . Since we are given that X and Y are independent,we have I ( XZ ; Q | Y Z ) + I ( Y Z ; Q | XZ ) ≥ I ( XY ; Q ) − .For all q ∈ Q , let D ( q ) = | Pr[ f ( X, Y ) = 0 | Q = q ] − Pr[ f ( X, Y ) = 1 | Q = q ] | . (cid:15) ≤ Pr[ Z = f ( X, Y )] − Pr[ Z (cid:54) = f ( X, Y )]= (cid:88) q ∈Q Pr[ Q = q ] (Pr[ Z = f ( X, Y ) | Q = q ] − Pr[ Z (cid:54) = f ( X, Y ) | Q = q ]) ≤ (cid:88) q ∈Q Pr[ Q = q ] D ( q ) , where in the last step we used the fact that H ( Z | Q ) = 0 .We shall deﬁne an auxiliary random variable R over all rectangles (i.e., with alphabet R = {X (cid:48) × Y (cid:48) : X (cid:48) ⊆X , Y (cid:48) ⊆ Y} ), jointly distributed with ( X, Y, Q ) , satisfying that the following conditions for each q ∈ Q . Below,let R ⊆ R denote the set of “small” rectangles: i.e., R = { r ∈ R : Pr[( X, Y ) ∈ r ] < α } , where α is aparameter to be set later. Also, for q ∈ Q , let L q ⊆ X × Y denote the set of all ( x, y ) which lie in the smallrectangles that occur with q ; i.e., L q = (cid:91) r ∈R :Pr[ Q = q,R = r ] > r. Claim 1.

There exists a random variable R with alphabet R , jointly distributed with ( X, Y, Q ) such that foreach q ∈ Q the following hold. For every r ∈ R such that Pr[ Q = q, R = r ] > , the distribution p XY | Q = q,R = r is the same as p XY | ( X,Y ) ∈ r (i.e., p XY restricted to the rectangle r ).• Pr[(

X, Y ) ∈ L q ] ≤ √ α . We prove this claim in Appendix B.Let ˆ R be a boolean random variable such that ˆ R = 0 iff R ∈ R , and ˆ R = 1 otherwise. Let Q (cid:48) = ( Q, ˆ R ) .Note that I ( XY ; Q ) ≥ I ( XY ; Q (cid:48) ) − ; so it is sufﬁcient to lower-bound I ( XY ; Q (cid:48) ) .First, we lower-bound Pr[ ˆ R = 0] , relying on the upper bound on discrepancy. Let D ( q, r ) = | Pr[ f ( X, Y ) =0 | Q = q, R = r ] − Pr[ f ( X, Y ) = 1 | Q = q, R = r ] | . Then D ( q ) ≤ (cid:80) r Pr[ R = r | Q = q ] D ( q, r ) . Further, Pr[(

Pr[(

X, Y ) ∈ r ] ≥ α for r (cid:54)∈ R , we conclude that D ( q, r ) ≤ ∆ α , for r (cid:54)∈ R . Now, (cid:15) ≤ (cid:88) q ∈Q Pr[ Q = q ] D ( q ) ≤ (cid:88) q,r ∈R Pr[ Q = q, R = r ] D ( q, r ) ≤ (cid:88) q,r ∈R Pr[ Q = q, R = r ] + (cid:88) q,r (cid:54)∈R Pr[ Q = q, R = r ] D ( q, r ) ≤ (cid:88) q,r ∈R Pr[ Q = q, R = r ] + ∆ α (cid:88) q,r (cid:54)∈R Pr[ Q = q, R = r ] ≤ Pr[ ˆ R = 0] + ∆ α (1 − Pr[ ˆ R = 0]) . So,

Pr[ ˆ R = 0] ≥ (cid:15) − ∆ α − ∆ α .Finally, we use the following lemma, proven in Appendix B (with S = ( X, Y ) , T = Q (cid:48) and T = Q × { } )to obtain our lower bound on I ( XY ; Q (cid:48) ) . Lemma 2.

Let

S, T be jointly distributed random variables over

S × T , and T ⊆ T be such that ∀ t ∈ T , Pr[ S ∈ S t ] ≤ δ where S t = { s ∈ S : Pr[ S = s | T = t ] > } , and Pr[ T ∈ T ] ≥ ε . Then, I ( S ; T ) ≥ ε log δ . We apply this lemma with δ = 2 √ α and ε = (cid:15) − ∆ α − ∆ α . This yields I ( XY ; Q (cid:48) ) ≥ (cid:15) − ∆ α − ∆ α ( log α − . As describedabove, this bound on I ( XY ; Q (cid:48) ) yields the following bound on tension: T Wyn ( XZ ; Y Z ) ≥ (cid:15) − ∆ α − ∆ α ( 12 log 1 α − − . (7)To complete the proof, we set α = ∆ (cid:15) , and note that since (cid:15) < , we have (cid:15)/ (1 − (cid:15) ) < . Remark:

Often ∆ is a quantity that vanishes as a size parameter of the inputs grows (e.g., when f is theinner-product function). When (cid:15) · log (cid:15) ∆ = ω (1) , one can obtain a tighter bound from the above proof, by setting α = (cid:0) ∆ (cid:15) (cid:1) − β for a small enough β > . This gives T Wyn ( XZ ; Y Z ) ≥ (cid:15) · log (cid:15) ∆ · (1 − o (1)) . Acknowledgments

We gratefully acknowledge Mark Braverman, Prahladh Harsha and Rahul Jain for helpful discussions and point-ers. 11 eferences [Abl96] Farid M. Ablayev. Lower bounds for one-way probabilistic communication complexity and theirapplication to space complexity.

Theor. Comput. Sci. , 157(2):139–159, 1996. 4[AK74] Rudolf Ahlswede and János Körner. On common information and related characteristics of corre-lated information sources. In , 1974. 4[BBCR13] Boaz Barak, Mark Braverman, Xi Chen, and Anup Rao. How to compress interactive communica-tion.

SIAM J. Comput. , 42(3):1327–1363, 2013. 4[BJLP13] Gábor Braun, Rahul Jain, Troy Lee, and Sebastian Pokutta. Information-theoretic approximationsof the nonnegative rank.

Electronic Colloquium on Computational Complexity (ECCC) , 20:158,2013. 4[BP13] Gábor Braun and Sebastian Pokutta. Common information and unique disjointness. In

FOCS ,pages 688–697, 2013. 4[BR11] Mark Braverman and Anup Rao. Information equals amortized communication. In

FOCS , pages748–757, 2011. 3, 4, 8[Bra12] Mark Braverman. Interactive information complexity. In

STOC , pages 505–524, 2012. 4[BW12] Mark Braverman and Omri Weinstein. A discrepancy lower bound for information complexity. In

APPROX-RANDOM , pages 459–470, 2012. 1, 3, 4[BYJKS04] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approachto data stream and communication complexity.

J. Comput. Syst. Sci. , 68(4):702–732, 2004. 4[CK81] Imre Csiszár and János Körner.

Information Theory: Coding Theorems for Discrete MemorylessSystems . Akadémiai Kiadó, Budapest, 1981. 14[CKW12] Amit Chakrabarti, Ranganath Kondapally, and Zhenghui Wang. Information complexity versuscorruption and applications to orthogonality and gap-hamming. In

APPROX-RANDOM , pages483–494, 2012. 4[CSWY01] Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Chi-Chih Yao. Informational com-plexity and the direct sum problem for simultaneous message complexity. In

FOCS , pages 270–278, 2001. 4[DPP14] Deepesh Data, Manoj M. Prabhakaran, and Vinod M. Prabhakaran. On the communication com-plexity of secure computation. In

Advances in Cryptology - CRYPTO 2014 - 34th Annual Cryp-tology Conference, Santa Barbara, CA, USA, August 17-21, 2014, Proceedings, Part II , pages199–216, 2014. 1, 4[GK73] Peter Gács and János Körner. Common information is far less than mutual information.

Problemsof Control and Information Theory , 2(2):149–162, 1973. 1, 4, 6, 13[HJMR10] Prahladh Harsha, Rahul Jain, David McAllester, and Jaikumar Radhakrishnan. The communicationcomplexity of correlation.

IEEE Transactions on Information Theory , 56(1):438–449, 2010. 4[JKS03] T. S. Jayram, Ravi Kumar, and D. Sivakumar. Two applications of information complexity. In

STOC , pages 673–682, 2003. 4 12JRS03] Rahul Jain, Jaikumar Radhakrishnan, and Pranab Sen. A direct sum theorem in communicationcomplexity via message compression. In

ICALP , pages 300–315, 2003. 4[JRS05] Rahul Jain, Jaikumar Radhakrishnan, and Pranab Sen. Prior entanglement, message compressionand privacy in quantum communication. In

IEEE Conference on Computational Complexity , pages285–296, 2005. 4[KLL +

12] Iordanis Kerenidis, Sophie Laplante, Virginie Lerays, Jérémie Roland, and David Xiao. Lowerbounds on information complexity via zero-communication protocols and applications. In

FOCS ,pages 500–509, 2012. 4[PP12] Manoj Prabhakaran and Vinod Prabhakaran. On secure multiparty sampling for more than twoparties. In

Proceedings of the 2012 IEEE International Information Theory Workshop (ITW 2012) ,2012. 4[PP14] Vinod M. Prabhakaran and Manoj M. Prabhakaran. Assisted common information with an appli-cation to secure two-party sampling.

IEEE Transactions on Information Theory , 60(6):3413–3434,2014. 1, 3, 4, 6, 13, 14, 15[PRV01] Stephen J Ponzio, Jaikumar Radhakrishnan, and Srinivasan Venkatesh. The communication com-plexity of pointer chasing.

Journal of Computer and System Sciences , 62(2):323–355, 2001. 4[SS02] Michael E. Saks and Xiaodong Sun. Space lower bounds for distance approximation in the datastream model. In

STOC , pages 360–369, 2002. 4[WW05] Stefan Wolf and Jürg Wullschleger. New monotones and lower bounds in unconditional two-partycomputation. In

CRYPTO , pages 467–477, 2005. 3, 4, 15[Wyn75] Aaron D. Wyner. The common information of two dependent random variables.

IEEE Transactionson Information Theory , 21(2):163–179, 1975. 1, 4, 6, 14[Yao79] Andrew Chi-Chih Yao. Some complexity questions related to distributive computing (preliminaryreport). In

STOC , pages 209–213, 1979. 1, 4

A On The Nature of Tension Region

In this appendix we present a gentle introduction to the notion of tension region, as developed in [PP14]. Werefer the interested readers to [PP14] for more details.Consider the random variables X = ( X (cid:48) , Q ) and Y = ( Y (cid:48) , Q ) where X (cid:48) , Y (cid:48) , Q are independent. In thiscase, it is natural to consider Q as the common random variable of X and Y and H ( Q ) as a natural measureof “common information.” Q is determined both by X and by Y individually. Moreover, conditioned on Q , X and Y are independent, i.e., X − Q − Y is a Markov chain. One could extend this to arbitrary X, Y , in acouple of natural ways. The approach of Gács and Körner [GK73]is to ﬁnd the “largest” random variable Q (largness being measured in terms of entropy) such that it is determined by X alone as well as by Y alone (withprobability 1): CI GK ( X ; Y ) = max p Q | XY : H ( Q | X )= H ( Q | Y )=0 H ( Q )= I ( X ; Y ) − min p Q | XY : H ( Q | X )= H ( Q | Y )=0 I ( X ; Y | Q ) . CI GK ( X ; Y ) ≤ I ( X ; Y ) and, in general, this inequality maybe strict, i.e., common information, ingeneral, does not account for all the dependence between X and Y .Wyner gave a different generalization [Wyn75] where he deﬁned common information in terms of the“smallest” random variable Q (smallness being measured in terms of I ( XY ; Q ) ) so that X and Y are inde-pendent conditioned on Q . CI Wyn ( X ; Y ) = min p Q | XY : X − Q − Y I ( XY ; Q )= I ( X ; Y ) + min p Q | XY : X − Q − Y ( I ( Y ; Q | X ) + I ( X ; Q | Y )) . Now CI Wyn ( X ; Y ) ≥ I ( X ; Y ) . When X, Y are of the form X = ( X (cid:48) , Q ) and Y = ( Y (cid:48) , Q ) , where X (cid:48) , Y (cid:48) , Q are independent, then there indeed is a unique interpretation of common information (when CI GK ( X ; Y ) = CI Wyn ( X ; Y ) = H ( Q ) ). Between these extremes represented by these two measures, there are several ways inwhich one could deﬁne a random variable to capture the dependence between X and Y . Deﬁnition 2.

For a pair of correlated random variables ( X, Y ) , and p Q | XY , we say Q perfectly resolves ( X, Y ) if I ( X ; Y | Q ) = 0 and H ( Q | X ) = H ( Q | Y ) = 0 . We say ( X, Y ) is perfectly resolvable if thereexists p Q | XY such that Q perfectly resolves ( X, Y ) . If ( X, Y ) is perfectly resolvable, then CI GK ( X ; Y ) = I ( X ; Y ) = CI Wyn ( X ; Y ) represents the entire mutualinformation between them. Tension region T ( X ; Y ) can be thought of as measuring the extent to which a pairof random variables ( X, Y ) is not resolvable. X Q Y I ( Y ; Q | X ) I ( X ; Q | Y ) I ( X ; Y | Q ) Figure 1

A Venn diagram representation of the three coordinates (cid:0) I ( Y ; Q | X ) , I ( X ; Q | Y ) , I ( X ; Y | Q ) (cid:1) in the deﬁnition of T ( X ; Y ) Q .Figure taken from [PP14]. Recall the deﬁnition of tension region T ( A ; B ) of a pair of random variables A, B : T ( A ; B ) = { ( r , r , r ) : ∃ Q jointly distributed with A, B s.t. I ( B ; Q | A ) ≤ r , I ( A ; Q | B ) ≤ r , I ( A ; B | Q ) ≤ r } . It follows from Fenchel-Eggleston’s strengthening of Carathéodory’s theorem [CK81, pg. 310], that we canrestrict ourselves to p Q | XY with alphabet Q such that |Q| ≤ |X ||Y| + 2 .It can be shown that T ( X ; Y ) includes the origin if and only if the pair ( X, Y ) is perfectly resolvable. Whenthis is not the case, it is important to consider all three coordinates of together to identify the unresolvablenature of a pair ( X, Y ) , because since T ( X ; Y ) does intersect each of the three axes, or in other words, any twocoordinates of can be made simultaneously 0 by choosing an appropriate Q .Below we summarize several useful properties of T ( X ; Y ) . For interpretations of T ( X ; Y ) in terms ofcertain information theoretic problems, we refer the reader to [PP14].14 ( X ; Y ) I ( X ; Y ) − C G K ( X ; Y ) − GK C Wyner ( X ; Y ) − I ( X ; Y ) Figure 2

A schematic representation of the region T ( X ; Y ) . T ( X ; Y ) is an unbounded, convex region, bounded away from the origin(unless ( X, Y ) is perfectly resolvable). Relationship between two points on the boundary of T ( X ; Y ) and the quantities CI GK (X;Y)and CI Wyn (X;Y) (The dotted line is at 45 ◦ to the axes.) Figure taken from [PP14]. A.1 Some Properties of Tension

Monotonicity of T ( X ; Y ) . Wolf and Wullschleger [WW05] showed that the three axes incercepts have a certain“monotonicity” property (they can only decrease, as

X, Y evolve as the views of two parties in a protocol). Infact, this monotinicity is a consequence of the monotinicity of the entire region T ( X ; Y ) stated in Proposition 1. Tensorization of T ( X ; Y ) . If ( X , Y ) is independent of ( X , Y ) ,then T (( X X ); ( Y Y )) = T ( X ; Y ) + T ( X ; Y ) . Convexity, closedness, and continuity of T ( X ; Y ) . Firstly, the region of tension is closed and convex. Sec-ondly, the region of tension is continuous in the sense that when the joint p.m.f. p X,Y is close to the joint p.m.f. p X (cid:48) ,Y (cid:48) , the tension regions T ( X ; Y ) and T ( X (cid:48) ; Y (cid:48) ) are also close. Speciﬁcally, if SD(

XY, X (cid:48) Y (cid:48) ) ≤ (cid:15) , then T ( X ; Y ) ⊆ T ( X (cid:48) ; Y (cid:48) ) − δ ( (cid:15) ) , where δ ( (cid:15) ) = 2 H ( (cid:15) ) + (cid:15) log max {|X | , |Y|} . B Proof of Lemma 2 and Claim 1.

To complete the proof of Theorem 5 we need to prove Lemma 2 and Claim 1. We do this below.

Proof of Lemma 2.

We have I ( S ; T ) = (cid:88) ( s,t ) ∈S×T p S,T ( s, t ) log p S,T ( s, t ) p S ( s ) p T ( t )= (cid:88) t ∈T p T ( t ) (cid:88) s ∈S t p S | T ( s | t ) log p S | T ( s | t ) p S ( s )= (cid:88) t ∈T p T ( t ) (cid:88) s ∈S t p S | T ( s | t ) log p S | T ( s | t ) p S ( s ) + (cid:88) t ∈T −T p T ( t ) (cid:88) s ∈S t p S | T ( s | t ) log p S | T ( s | t ) p S ( s ) Notice that, for each t (cid:88) s ∈S t p S | T ( s | t ) log p S | T ( s | t ) p S ( s ) = D ( p S | T = t (cid:107) p S ) ≥ . I ( S ; T ) ≥ (cid:88) t ∈T p T ( t ) (cid:88) s ∈S t p S | T ( s | t ) log p S | T ( s | t ) p S ( s ) . For each t ∈ T , let p t = Pr[ S ∈ S t ] = (cid:80) s ∈S t p S ( s ) , and let us deﬁne over S t the probability mass function, p ( t ) ( s ) = p S ( s ) p t , s ∈ S t . Note that p t ≤ δ . Then, for t ∈ T , (cid:88) s ∈S t p S | T ( s | t ) log p S | T ( s | t ) p S ( s ) = (cid:88) s ∈S t p S | T ( s | t ) log p S | T ( s | t ) p S ( s ) /p t p t = D ( p S | T || p ( t ) ) + log 1 p t ≥ log 1 δ . Subtituting this back, I ( S ; T ) ≥ (cid:88) t ∈T p T ( t ) log 1 δ ≥ ε log 1 δ . Proof of Claim 1.

It remains to describe the distribution p R | XY Q so that the conditions listed in Claim 1 hold.For r = X r × Y r ∈ R , we let σ q,r = min x ∈X r Pr[ X = x, Q = q ]Pr[ X = x ] Pr[ Q = q ] − max x (cid:48) (cid:54)∈X r Pr[ X = x (cid:48) , Q = q ]Pr[ X = x (cid:48) ] Pr[ Q = q ] τ q,r = min y ∈Y r Pr[ Y = y, Q = q ]Pr[ Y = y ] Pr[ Q = q ] − max y (cid:48) (cid:54)∈Y r Pr[ Y = y (cid:48) , Q = q ]Pr[ Y = y (cid:48) ] Pr[ Q = q ] Above, in deﬁning max x (cid:48) (cid:54)∈X r , if no such x (cid:48) exists – i.e., X r = X – we take the maximum to be 0 (and similarlyfor max y (cid:48) (cid:54)∈Y r ). Now we deﬁne p R | XY Q as follows:

Pr[ R = r | X = x, Y = y, Q = q ] = (cid:40) σ q,r · τ q,r · Pr[ X = x,Y = y ]Pr[ X = x,Y = y | Q = q ] if σ q,r > , τ q,r > and ( x, y ) ∈ r otherwise.An alternate way to describe the mass assigned to r is as follows. Let X q × Y q be the support of p XY | Q = q .Let X q = { x , · · · , x M } , such that Pr[ X = x i ,Q = q ]Pr[ X = x i ] Pr[ Q = q ] ≥ Pr[ X = x i +1 ,Q = q ]Pr[ X = x i +1 ] Pr[ Q = q ] for all i ∈ [1 , M − . For notationalconvenience, we also deﬁne a dummy x M +1 with Pr[ X = x M +1 ,Q = q ]Pr[ X = x M +1 ] Pr[ Q = q ] = 0 . Deﬁne y , · · · , y N , y N +1 similarly,where N = |Y q | . Then, the only rectangles r for which Pr[ R = r | Q = q ] can be positive are of the form r ij = X i × Y j for ( i, j ) ∈ [ M ] × [ N ] , where X i = { x , · · · , x i } , Y j = { y , · · · , y j } , Pr[ X = x i ,Q = q ]Pr[ X = x i ] Pr[ Q = q ] > Pr[ X = x i +1 ,Q = q ]Pr[ X = x i +1 ] Pr[ Q = q ] , and Pr[ Y = y j ,Q = q ]Pr[ Y = y j ] Pr[ Q = q ] > Pr[ Y = y j +1 ,Q = q ]Pr[ Y = y j +1 ] Pr[ Q = q ] .16irst, we verify that p R | Q = q,X = x,Y = y is indeed a valid probability distribution. (cid:88) r ∈R Pr[ R = r | Q = q, X = x i ∗ , Y = y i ∗ ]= (cid:88) r :( x i ∗ ,y i ∗ ) ∈ r σ q,r · τ q,r · Pr[ X = x i ∗ , Y = y i ∗ ]Pr[ X = x i ∗ , Y = y i ∗ | Q = q ]= Pr[ X = x i ∗ , Y = y i ∗ ]Pr[ X = x i ∗ , Y = y i ∗ | Q = q ] · M (cid:88) i = i ∗ N (cid:88) j = j ∗ σ q,r ij · τ q,r ij = Pr[ X = x i ∗ , Y = y i ∗ ]Pr[ X = x i ∗ , Y = y i ∗ | Q = q ] · M (cid:88) i = i ∗ (cid:18) Pr[ X = x i , Q = q ]Pr[ X = x i ] Pr[ Q = q ] − Pr[ X = x i +1 , Q = q ]Pr[ X = x i +1 ] Pr[ Q = q ] (cid:19) · N (cid:88) j = j ∗ (cid:18) Pr[ Y = y j , Q = q ]Pr[ Y = y j ] Pr[ Q = q ] − Pr[ Y = y j +1 , Q = q ]Pr[ Y = y j +1 ] Pr[ Q = q ] (cid:19) = Pr[ X = x i ∗ , Y = y i ∗ ]Pr[ X = x i ∗ , Y = y i ∗ | Q = q ] · Pr[ X = x i ∗ , Q = q ]Pr[ X = x i ∗ ] Pr[ Q = q ] · Pr[ Y = y j ∗ , Q = q ]Pr[ Y = y j ∗ ] Pr[ Q = q ] = 1 , where in the last step we used the facts that X, Y are independent and also they are conditionally indepdendentconditioned on Q .Next, we verify that p XY | Q = q,R = r ≡ p XY | ( X,Y ) ∈ r . Firstly, if ( x, y ) (cid:54)∈ r , then Pr[ R = r | X = x, Y = y, Q = q ] = 0 , and hence Pr[ X = x, Y = y | Q = q, R = r ] = 0 (and also, Pr[ X = x, Y = y | ( X, Y ) ∈ r ] =0 ). Now, suppose ( x, y ) ∈ r . Then, Pr[ X = x, Y = y | Q = q, R = r ] = Pr[ R = r | X = x, Y = y, Q = q ] Pr[ X = x, Y = y | Q = q ]Pr[ R = r | Q = q ]= σ q,r · τ q,r · Pr[ X = x, Y = y ]Pr[ R = r | Q = q ] = Pr[ X = x, Y = y ] F ( q, r ) , where F ( q, r ) is a quantity independent of ( x, y ) . Since Pr[ X = x, Y = y | Q = q, R = r ] is a probabilitydistribution, F ( q, r ) = (cid:80) ( x,y ) ∈ r Pr[ X = x, Y = y ] = Pr[( X, Y ) ∈ r ] . Thus indeed, Pr[ X = x, Y = y | Q = q, R = r ] = Pr[ X = x, Y = y | ( X, Y ) ∈ r ] .Finally, we argue that Pr[(

X, Y ) ∈ L q ] ≤ √ α . Consider any q ∈ Q , and as before, let X q = { x , · · · , x M } , Y q = { y , · · · , y N } sorted appropriately, and, for i ∈ [ M ] , j ∈ [ N ] , r ij = { x , · · · , x i } × { y , · · · , y j } . Then ( x, y ) ∈ L q iff ( x, y ) ∈ r ij for some r ij ∈ R (i.e., Pr[(

X, Y ) ∈ r ij ] ≤ α ). Let i ∗ be the maximum valuein [ M ] such that Pr[ X ∈ { x , · · · , x i ∗ } ] ≤ √ α , and similarly, let j ∗ be the maximum value in [ N ] such that Pr[ Y ∈ { y , · · · , y j ∗ } ] ≤ √ α . Then we note that, if i > i ∗ and j > j ∗ , then ( x i , y j ) (cid:54)∈ L q . This is because, ( x i , y j ) ∈ r i (cid:48) j (cid:48) = ⇒ ( i (cid:48) ≥ i > i ∗ , j (cid:48) ≥ j > j ∗ ) = ⇒ r i (cid:48) j (cid:48) (cid:54)∈ R , as Pr[(

X, Y ) ∈ r i (cid:48) j (cid:48) ] = Pr[ X ∈{ x , · · · , x i (cid:48) } ] · Pr[ Y ∈ { y , · · · , y j (cid:48) } ] > √ α √ α (by deﬁnition of i ∗ and j ∗ ). Hence, Pr[(

X, Y ) ∈ L q ] ≤ Pr[( X ∈ { x , · · · , x i ∗ } ∨ ( Y ∈ { y , · · · , y j ∗ } ] ≤ Pr[ X ∈ { x , · · · , x i ∗ } ] + Pr[ Y ∈ { y , · · · , y j ∗ } ] ≤ √ α.α.