Inference under Information Constraints II: Communication Constraints and Shared Randomness
aa r X i v : . [ c s . D S ] M a y Inference under Information Constraints II:Communication Constraints and Shared Randomness
Jayadev Acharya ∗ Clément L. Canonne † Himanshu Tyagi ‡ Abstract
A central server needs to perform statistical inference based on samples that are distributed overmultiple users who can each send a message of limited length to the center. We study problems ofdistribution learning and identity testing in this distributed inference setting and examine the role ofshared randomness as a resource. We propose a general-purpose simulate-and-infer strategy that usesonly private-coin communication protocols and is sample-optimal for distribution learning. This generalstrategy turns out to be sample-optimal even for distribution testing among private-coin protocols.Interestingly, we propose a public-coin protocol that outperforms simulate-and-infer for distributiontesting and is, in fact, sample-optimal. Underlying our public-coin protocol is a random hash that whenapplied to the samples minimally contracts the chi-squared distance of their distribution to the uniformdistribution. ∗ Cornell University. Email: [email protected]. Supported in part by NSF-CCF-1846300 (CAREER). † Stanford University. Email: [email protected]. Supported by a Motwani Postdoctoral Fellowship. ‡ Indian Institute of Science. Email: [email protected]. Supported in part by the Bosch Research and Technology Centre,Bangalore, India under the Project E-SenseAn abridged version of this work will appear in the 2019 International Conference on Machine Learning (ICML).
DRAFT
ONTENTS
I Introduction
II Notation and preliminaries III The setup: Communication, simulation, and inference protocols
IV Distributed simulation ℓ < log k . . . . . . . . . . . . . . . . . . 14IV-B An α -simulation protocol using rejection sampling . . . . . . . . . . . . . . . . . . 16 V Simulate-and-Infer ℓ -bit distributed inference via distributed simulation . . . . . . . . . . 21V-B Application: private-coin protocols from distributed simulation . . . . . . . . . . . 23V-C Optimality of our distributed simulation protocol . . . . . . . . . . . . . . . . . . . 24 VI Public-coin identity testing Appendix
29A Impossibility of perfect simulation in the interior of the probability simplex . . . . 29B Proof of Theorem VI.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32C A randomness-efficient variant of Theorem VI.1 . . . . . . . . . . . . . . . . . . . 39D From uniformity to parameterized identity testing . . . . . . . . . . . . . . . . . . . 42
References I. I
NTRODUCTION
Sample-optimal statistical inference has taken center stage in modern data analytics, where the samplesize can be comparable to the dimensionality of the data. In many emerging applications, especially thosearising in sensor networks and the Internet of Things (IoT), we are not only constrained in the number ofsamples but, also, are given access to only limited communication about the samples. We consider sucha distributed inference setting and seek sample-optimal algorithms for inference under communicationconstraints.In our setting, there are n players, each of which gets an independent draw from an unknown k -arydistribution and can send only ℓ bits about their observed sample to a central referee using a simultaneousmessage passing (SMP) protocol for communication. The referee uses communication from the playersto accomplish an inference task P . What is the minimum number of players n required by an SMP protocol that successfully accomplishes P , as a function of k , ℓ , and the relevant parameters of P ? Our first contribution is a general simulate-and-infer strategy for inference under communication con-straints where we use the communication to simulate samples from the unknown distribution at thereferee. To describe this strategy, we introduce a natural notion of distributed simulation : n playerseach observing an independent sample from an unknown k -ary distribution p can send ℓ bits each to areferee. A distributed simulation protocol consists of an SMP protocol and a randomized decision mapthat enables the referee to generate a sample from p using the communication from the players. Clearly,when ℓ ≥ log k such a sample can be obtained by getting the sample of any one player. But what canbe done in the communication-starved regime of ℓ < log k ?We first show that perfect simulation is impossible using any finite number of players in the communication-starved regime. But perfect simulation is not even required for our application. When we allow a smallprobability of declaring failure, namely admit Las Vegas simulation schemes, we obtain a distributedsimulation scheme that requires an optimal O (cid:0) k/ ℓ (cid:1) players to simulate k -ary distributions using ℓ bitsof communication per player. Thus, our proposed simulate-and-infer strategy can accomplish P with afactor O (cid:0) k/ ℓ (cid:1) blow-up in sample complexity.The specific inference tasks we focus on are those of distribution learning , where we seek to estimatethe unknown k -ary distribution to an accuracy of ε in total variation distance, and identity testing wherewe seek to know if the unknown distribution is a pre-specified reference distribution q or at total variation We assume throughout that log is in base 2, and for ease of discussion assume in this introduction that log k is an integer. DRAFT distance at least ε from it. For distribution learning, the simulate-and-infer strategy matches the lowerbound from [32] and is therefore sample-optimal. For identity testing, the plot thickens.Recently, a lower bound for the sample complexity of identity testing using only private-coin protocolswas established [3]. The simulate-and-infer protocol is indeed a private-coin protocol, and we show that itachieves this lower bound. When public coins (shared randomness) are available, [3] derived a different,more relaxed lower bound. The performance of simulate-and-infer is far from this lower bound. Oursecond contribution is a public-coin protocol for identity testing that not only outperforms simulate-and-infer but matches the lower bound in [3] and is sample-optimal.We provide a concrete description of our results in the next section, followed by an overview of ourproof techniques in the subsequent section. To put our results in context, we provide a brief overview ofthe literature as well. A. Main results
We begin by summarizing our distributed simulation results. Theorem I.1.
For every k, ℓ ≥ , there exists a private-coin protocol with ℓ bits of communication perplayer for distributed simulation over [ k ] and expected number of players O (cid:0) ( k/ ℓ ) ∨ (cid:1) . Moreover, thisexpected number is optimal, up to constant factors, even when public-coin and interactive communicationprotocols are allowed. The proposed protocol only provides a relaxed guarantee, as its number of players is only boundedin expectation. In fact, we can show that distributed simulation is impossible, unless we allow for suchrelaxation.
Theorem I.2.
For k ≥ , ℓ < ⌈ log k ⌉ , and any N ∈ N , there exist no SMP protocol with N players and ℓ bits of communication per player for distributed simulation over [ k ] . Furthermore, the result continuesto hold even for public-coin and interactive communication protocols. The proof is given in Section IV-A. For simplicity of exposition, we describe the next result in terms of Las Vegas algorithms, which produce a sample fromthe unknown distribution when it terminates, yet may never terminate. Equivalently, one may enforce a strict number of playersbut allow the protocol to abort with a special symbol with small constant probability, which is how our results will be statedin Section IV-B.
DRAFT
Since the distributed simulation protocol in Theorem I.1 is a private-coin protocol, we can use itto generate the desired number of samples from the unknown distribution at the center to obtain thefollowing result.
Theorem I.3 (Informal) . For any inference task P over k -ary distributions with sample complexity s inthe non-distributed model, there exists a private-coin protocol for P using ℓ bits of communication perplayer and requiring n = O ( s · ( k/ ℓ ∨ players. Instantiating this general statement for distribution learning and identity testing leads to the followingresults.
Corollary I.4.
For every k, ℓ ≥ , simulate-and-infer can accomplish distribution learning over [ k ] , with ℓ bits of communication per player and n = O (cid:16) k (2 ℓ ∧ k ) ε (cid:17) players. Corollary I.5.
For every k, ℓ ≥ , simulate-and-infer can accomplish identity testing over [ k ] using ℓ bits of communication per player and n = O (cid:16) k / (2 ℓ ∧ k ) ε (cid:17) players. By the lower bound for sample complexity of distribution learning in [32] (see, also, [3]), we notethat simulate-and-infer is sample-optimal for distribution learning even when public-coin protocols areallowed. In fact, the sample complexity of simulate-and-infer for identity testing matches the lower boundfor private-coin protocols in [3], rendering it sample-optimal.Perhaps our most striking result is the next one, which shows that public-coin protocols can outperformthe sample complexity of private-coin protocols for identity testing by a factor of p k/ ℓ . Theorem I.6.
For every k, ℓ ≥ , there exists a public-coin protocol for identity testing over [ k ] using ℓ bits of communication per player and n = O (cid:16) k (2 ℓ/ ∧√ k ) ε (cid:17) players. We further note that our protocol is quite simple to describe and implement: We generate a randompartition of [ k ] into ℓ equisized parts and report which part each sample lies in. Although, as stated,our protocol seems to require Ω( ℓ · k ) bits of shared randomness, inspection of the proof shows that -wise independent shared randomness suffice, drastically reducing the number of random bits required.See Remark VI.7 for a discussion.Our results are summarized in the table below. DRAFT
TABLE IS
UMMARY OF THE SAMPLE COMPLEXITY OF DISTRIBUTED LEARNING AND TESTING , UNDER PRIVATE AND PUBLICRANDOMNESS . A
LL RESULTS ARE ORDER - OPTIMAL .Distribution Learning Identity TestingPublic-Coin Private-Coin Public-Coin Private-Coin kε · k ℓ √ kε · q k ℓ √ kε · k ℓ B. Proof techniques
We now provide a high-level description of the proofs of our main results. a) Distributed simulation:
The upper bound of Theorem I.3 uses a rejection-sampling-based ap-proach; see Section IV-B for details. The lower bound follows by relating distributed simulation tocommunication-constrained distribution learning and using the lower bound for sample complexity of thelatter from [32], [3]. b) Distributed identity testing:
For the ease of exposition, we hereafter focus on uniformity testing,as it contains most of the ideas. To test whether an unknown distribution p is uniform using at most ℓ bitsto describe each sample, a natural idea is to randomly partition the alphabet into L := 2 ℓ parts, and sendto the referee independent samples from the L -ary distribution p ′ induced by p on this partition. For arandom balanced partition ( i.e. , where every part has cardinality k/L ), clearly the uniform distribution u k is mapped to the uniform distribution u L . Thus, one can hope to reduce the problem of testing uniformityof p (over [ k ] ) to that of testing uniformity of p ′ (over [ L ] ). The latter task would be easy to perform, asevery player can simulate one sample from p ′ and communicate it fully to the referee with log L = ℓ bitsof communication. Hence, the key issue is to argue that this random “flattening” of p would somehowpreserve the distance to uniformity. Namely, that if p is ε -far from u k , then (with a constant probabilityover the choice of the random partition) p ′ will remain ε ′ -far from u L , for some ε ′ depending on ε , L , and k . If true, then it is easy to see that this would imply a very simple protocol with O ( √ L/ε ′ ) players, where all agree on a random partition and send the induced samples to the referee, who then runsa centralized uniformity test. Therefore, in order to apply the aforementioned natural recipe, it sufficesto derive a “random flattening” structural result for ε ′ ≍ p ( L/k ) ε .An issue with this approach, unfortunately, is that the total variation distance (that is, the ℓ distance)does not behave as desired under these random flattenings, and the validity of our desired result remainsunclear. Interestingly, an analogous statement with respect to the ℓ distance turns out to be much moremanageable and suffices for our purposes. Specifically, we show that a random flattening of p does DRAFT preserve, with constant probability, the ℓ distance to uniformity. In our case, by the Cauchy–Schwarzinequality the original ℓ distance will be at least γ ≍ ε/ √ k , which implies using known ℓ testingresults that one can test uniformity of the “randomly flattened” p ′ with O (1 / ( √ Lγ )) = O ( k/ (2 ℓ/ ε )) samples. This yields the desired guarantees on the protocol. C. Related prior work
The distribution learning problem is a finite-dimensional parametric learning problem, and the identitytesting problem is a specific goodness-of-fit problem. Both these problems have a long history in statistics.However, the sample-optimal setting of interest to us has received a lot of attention in the past decade,especially in the computer science literature; see [40], [16], [8] for surveys. Most pertinent to our workis uniformity testing [28], [39], [20], the prototypical distribution testing problem for which the samplecomplexity was established to be Θ( √ k/ε ) in [39], [43]; as well as identity testing, shown to haveorder-wise similar sample complexity [10], [4], [43], [22], [27].Distributed hypothesis testing and estimation problems were first studied in information theory, althoughin a different setting than what we consider [6], [29], [30]. The focus in that line of work has been tocharacterize the trade-off between asymptotic error exponent and communication rate per sample.Closer to our work is distributed parameter estimation and functional estimation that has gainedsignificant attention in recent years (see e.g. , [23], [25], [14], [45]). In these works, much like our setting,independent samples are distributed across players, which deviates from the information theory settingdescribed above where each player observes a fixed dimension of each independent sample. However,the communication model in these results differs from ours, and the communication-starved regime weconsider has not been studied in these works.The problem of distributed density estimation, too, has gathered recent interest in various statisticalsettings [13], [9], [48], [41], [21], [32], [47], [5]. Among these, our work is closest to the resultsin [32], [31] and [21]. In particular, [21] considers both ℓ (total variation) and ℓ losses, although in adifferent setting than ours. They study an interactive model where the players do not have any individualcommunication constraint, but instead the goal is to bound the total number of bits communicated overthe course of the protocol. This difference in the model leads to incomparable results and techniques (forinstance, the lower bound for learning k -ary distributions in our model is higher than the upper boundin theirs).Our current work further deviates from this prior literature, since we consider distribution testing aswell and examine the role of public-coin for SMP protocols. Additionally, a central theme here is theconnection to distribution simulation and its limitation in enabling distributed testing. In contrast, the DRAFT prior work on distribution estimation, in essence, establishes the optimality of simple protocols thatrely on distributed simulation for inference. We note that although recent work of [12] considers bothcommunication complexity and distribution testing, their goal and results are very different – indeed, theyexplain how to leverage on negative results in the standard SMP model of communication complexity toobtain sample complexity lower bounds in collocated distribution testing.Problems related to joint simulation of probability distributions have been the object of focus in theinformation theory and computer science literature. Starting with the works of Gács and Körner [24]and Wyner [46] where the problem of generating shared randomness from correlated randomness andvice-versa, respectively, were considered, several important variants have been studied such as correlatedsampling [15], [37], [33], [11] and non-interactive simulation [35], [26], [19]. Yet, our problem of exactsimulation of a single (unknown) distribution with communication constraints from multiple parties hasnot been studied previously to the best of our knowledge.
D. Relation to chi-square contraction lower bounds
This work is the second of a series of papers, the first of which ([3]) presented a general technique forestablishing lower bounds for inference under information constraints. When information constraints areimposed, the statistical distances shrink due to the data processing inequality. At a high-level, the lowerbound in [3] was based on quantifying the contraction in chi-square distance in a neighborhood of theuniform distribution due to information constraints. Note that in view of the reduction in Appendix D,the neighborhood of any distribution is roughly isometric to the neighborhood of the uniform distribution(though the isometry can depend on the reference distribution). Thus, our lower bound aptly capturesthe bottleneck imposed by information constraints for a broad class of inference problems, and not justuniformity testing.The current article, and our upcoming article [1], seeks to find schemes that match the lower boundsestablished in [3]. An interesting feature of our lower bounds is that they quantitatively differentiatethe chi-square contraction caused by private- and public-coin protocols. Our schemes in this paper drawon the principles established by our lower bounds and use a minimally contracting hash for inferenceunder information constraints. Specifically, our private-coin simulate-and-infer scheme and public-coinscheme are based on identifying a private-coin and public-coin communication protocol, respectively, thatminimally contract the chi-square distances in the neighborhood of the uniform distribution. We termthis principle of designing inference schemes under information constraints the minimally contracting See the preprint [1] for a preliminary version.
DRAFT hashing (MCH) principle. At this point, it is just a heuristic where we seek mappings that attain theminmax and maxmin chi-square contractions that appear in our lower bounds in [3], and propose themas a good candidate for selecting channels for inference under information constraints in our setting. Webelieve, however, that a formal version of the MCH principle can be established and applied gainfully inthis setting.The MCH principle seems to remain valid even for local privacy constraints, as considered in [1].Moreover, in addition to the papers in this series, our preliminary calculations suggest that our treatmentand the MCH principle extend to testing problems concerning high-dimensional distributions as well.Finally, while in this paper we have quantified the reduction in sample complexity due to availability ofpublic randomness for a fixed amount of communication per sample, quantifying the complete sample-randomness tradeoff for distributed identity testing under communication constraints is work in progress.
E. Organization
We begin by formally introducing our distributed model in Section III. Next, Section IV introduces thequestion of distributed simulation and contains our protocols and impossibility results for this problem.In Section V, we consider the relation between distributed simulation and private-coin distributioninference. The subsequent section, Section VI, focuses on the problem of identity testing and containsthe proof of Theorem I.6. II. N
OTATION AND PRELIMINARIES
Throughout this paper, we denote by log the logarithm to the base . We use standard asymptoticnotation O ( · ) , Ω( · ) , and Θ( · ) for complexity orders, and will sometimes write a n . b n to indicate thatthere exists an absolute constant c > such that a n ≤ c · b n for all n . Finally, we will denote by a ∧ b and a ∨ b the minimum and maximum of two numbers a and b , respectively.Let [ k ] be the set of integers { , , . . . , k } . Given a fixed (and known) discrete domain X of cardinality |X | = k , we write ∆ k for the set of probability distributions over X , i.e. , ∆ k = { p : [ k ] → [0 ,
1] : k p k = 1 } . For a discrete set X , we denote by u X the uniform distribution on X and will omit the subscript whenthe domain is clear from context.The total variation distance between two probability distributions p , q ∈ ∆ k is defined as d TV ( p , q ) := sup S ⊆X ( p ( S ) − q ( S )) = 12 X x ∈X | p ( x ) − q ( x ) | , DRAFT namely, d TV ( p , q ) is equal to half of the ℓ distance of p and q . In addition to total variation distance,we will extensively use the ℓ distance between distributions p , q ∈ ∆ k , denoted k p − q k .III. T HE SETUP : C
OMMUNICATION , SIMULATION , AND INFERENCE PROTOCOLS
A. Communication protocols
We restrict ourselves to simultaneous message passing (SMP) protocols of communication, wherein themessages from all players are transmitted simultaneously to the central server, and no other communicationis allowed. We allow randomized SMP protocols and distinguish between two forms of randomness:private-coin protocols, where each player can only use their own independent private randomness that isnot available to the referee and public-coin protocols, where the players and the referee have access toshared randomness. SMP rules out any other interaction between the players except the agreement onthe protocol and coordination using shared randomness for public-coin SMP protocols. In particular, thissetting precludes interactive communication models. Nonetheless, this setting is natural for a variety ofuse-cases where players represent users connected to a central server or sensors connected to a fusioncenter. It can even be used for the case where each sample is seen by the same machine, but at differenttimes, and the machine does not maintain any memory to store the previous samples. For instance, thismachine can be an analog-to-digital converter that quantizes each input to ℓ bits. Definition III.1 (Private-coin SMP Protocols) . Let U , . . . , U n denote independent random variables,which are also independent jointly of ( X , . . . , X n ) , and represent the private randomness of the players.An ℓ -bit private-coin SMP protocol π consists of the following two steps: (a) Player i selects their channel W i ∈ W ℓ as a function of U i , (b) and sends their message M i ∈ { , } ℓ , which is obtained by passing X i through W i , to the referee. The referee receives the messages M = ( M , M , . . . , M n ) , but does nothave access to the private randomness ( U , . . . , U n ) of the players.We assume that the protocol is decided ahead of time, namely the distribution of U i s is known tothe referee, but not the instantiation. Note that in a private-coin SMP communication protocol, thecommunication M i from player i is a randomized function of ( X i , U i ) . Moreover, since both ( X , . . . , X n ) and ( U , . . . , U n ) are generated from a product distribution, so is ( M , . . . , M n ) . Definition III.2 (Public-coin SMP Protocols) . Let U be a random variable independent of ( X , . . . , X n ) ,available to all players and the referee. An ℓ -bit private-coin SMP protocol π consists of the followingtwo steps: (a) Players select their channels W , . . . , W n ∈ W ℓ as a function of U , and (b) send their DRAFT0 messages M , . . . , M n ∈ { , } ℓ , by passing X i through W i , to the referee. The referee receives themessages M = ( M , . . . , M n ) and is given access to U as well.In contrast to private-coin protocols, in a public-coin SMP communication protocol, the communication M i from player i is a (randomized) function of ( X i , U ) and therefore the M i s are not independent. Theyare, however, independent conditioned on the shared randomness U .We denote the communication protocols that are used at the players to generate the messages by π .For public-coin protocols, to make explicit the role of the randomness in the choice of the channels, wesometimes write π ( x n , u ) to denote the output of the protocol (messages) when the input of the playersis x n = ( x , . . . , x n ) and the public-coin realization is U = u . Also, we write π i ( x n , u ) for the messagesent by player i using protocol π . See Fig. 1 for a depiction of the communication setting. Fig. 1. The communication-constrained distributed model, where each M i ∈ { , } ℓ . In the private-coin setting the channels M , . . . , M n are independent, while in the public-coin setting they are jointly randomized. X X . . . X n − X n M M . . . M n − M n p R output B. Distributed simulation protocols
The distributed simulation problem we propose is rather natural, yet, to the best of our knowledge,has not been studied in prior literature. In this section, we will define the simulation problem, and inthe next section exhibit its use as a natural tool to solve any communication-limited inference problem.Recall that our goal is to enable the referee to generate samples from the unknown distribution usingcommunication from the players. Note that players only know the alphabet [ k ] from which samples are DRAFT1 generated, but have no other knowledge of the distribution. We allow the players to use an SMP protocol,private-coin or public-coin, to facilitate simulation of samples by the referee.We now state the question of simulation formally. An ℓ -bit simulation protocol S = ( π, T ) of k -arydistributions using n players consists of an ℓ -bit SMP protocol π and a decision mapping T . The output of π is an element in M n , where M = { , } ℓ . The decision mapping T : M n → X ∪ {⊥} is a randomizedfunction that takes as input the messages from the players and outputs an element in X ∪ {⊥} . Uponreceiving messages m n = ( m , . . . , m n ) ∈ M n , the referee outputs an x ∈ X with probability T ( x | m n ) and the symbol ⊥ with probability T ( ⊥ | m n ) = 1 − P x ∈X T ( x | m n ) . The protocol is private-coinif π is a private-coin communication protocol, and it is public-coin if π is public-coin. For public-coinprotocols, the decision mapping T = T U can be chosen as a function of U , the public randomness.We want the distribution of the random output of the decision mapping to coincide with the unknownunderlying distribution p . This objective is made precise next. Definition III.3 ( α -Simulation) . A protocol S = ( π, T ) is an α -simulation protocol if for every p ∈ ∆ k that generates the input samples X , . . . , X n for the SMP protocol π , the output ˆ X = T ( π ( X , . . . , X n )) ∈X S {⊥} of the simulation protocol T satisfies Pr X n ∼ p n h ˆ X = x | ˆ X = ⊥ i = p x , ∀ x ∈ X , and the probability of abort satisfies Pr X n ∼ p n h ˆ X = ⊥ i ≤ α. A -simulation, namely a simulation with probability of abort zero, is termed perfect simulation . C. Distributed inference protocols
We give a general, decision-theoretic description of distributed inference protocols that is applicablebeyond the use-cases considered in this work. For the most part, we will restrict to learning and identitytesting of discrete distributions, but our results for distributed inference are valid for general settings.We start with a description of inference tasks. An inference problem P is a tuple ( C , X , E , l ) , where C is a collection of distributions over X , E is a class of allowed actions or decisions that can be takenupon observing samples generated from p ∈ C , and l : C × E → R q + is a loss function used to evaluatethe performance. A (randomized) decision rule is a map e : X n → E , and for samples X n generatedfrom p ∈ C , the loss of the decision rule is measured by the vector l ( p , e ( X n )) in R q + . Our benchmarkfor performance will be the the expected loss vector L ( p , e ) := E X n ∼ p [ l ( p , e ( X n )] . (1) DRAFT2
Note that the expected loss vector, too, is a q -dimensional vector.An ℓ -bit distributed inference protocol I = ( π, e ) for the inference problem ( C , X , E , l ) consistsof an ℓ -bit SMP protocol π and an estimator e available to the referee who, upon observing themessages M = ( M , . . . , M n ) , and follows a (randomized) decision rule e : M n → E . For private-coin inference protocols, π is a private-coin SMP protocol, and for public-coin inference protocols, boththe communication protocol π and the decision rule e are allowed to depend on the public randomness U , available to everyone.We now state a measure of performance of inference protocols. Definition III.4 ( ~γ -Inference protocol) . For ~γ ∈ R q + , a protocol ( π, e ) is a ~γ -inference protocol if, forevery p ∈ C , L i ( p , e ) ≤ γ i , ∀ ≤ i ≤ q , where L i ( p , e ) denotes the i th coordinate of L ( p , e ) .We instantiate the abstract definitions above with two illustrative examples that we study in this paper. a) Distribution Learning: In the ( k, ε ) -distribution learning problem, we seek to estimate a distri-bution p in ∆ k to within ε in total variation distance. Formally, a (randomized) mapping e : X n → ∆ k constitutes an ( n, ε, δ ) -estimator for ∆ k if the estimate ˆ p = e ( X n ) satisfies sup p ∈ ∆ k Pr X n ∼ p [ d TV (ˆ p , p ) > ε ] < δ, where d TV ( p , q ) denotes the total variation distance between p and q . Namely, ˆ p estimates the inputdistribution p to within distance ε with probability at least − δ .The sample complexity of ( k, ε, δ ) -distribution learning is the minimum n such that there exists an ( n, ε, δ ) -estimator for ∆ k . It is well-known that the sample complexity of distribution learning is Θ( k/ε ) and the empirical distribution attains it.This problem can be cast in our general framework by setting X = [ k ] , C = E = ∆ k , q = 1 , and L ( p , ˆ p ) is given by l ( p , ˆ p ) := { d TV ( p , ˆ p ) >ε } . For this setting of distribution learning, we term the δ -inference protocol an ℓ -bit ( k, ε, δ ) - learningprotocol for n players. b) Identity Testing: Let q ∈ ∆ k be a known reference distribution. In the ( k, ε, δ ) -identity testingproblem, we seek to use samples from unknown p ∈ ∆ k to test if p equals q or if it is ε -far from q in DRAFT3 total variation distance. Specifically, an ( n, ε, δ ) -test is given by a (randomized) mapping T : X n → { , } such that Pr X n ∼ p n [ T ( X n ) = 1 ] > − δ, if p = q , Pr X n ∼ p n [ T ( X n ) = 0 ] > − δ, if d TV ( p , q ) > ε. Namely, upon observing independent samples X n , the algorithm should “accept” with high constant prob-ability if the samples come from the reference distribution q and “reject” with high constant probabilityif they come from a distribution significantly far from q .The sample complexity of ( k, ε, δ ) -identity testing is the minimum n for which an ( n, ε, δ ) -test existsfor q . While this quantity can depend on the reference distribution q , it is customary to consider samplecomplexity over the worst-case q . In this worst-case setting, while it has been known for some timethat the most stringent sample requirement arises for q set to the uniform distribution, a recent resultof [27] provides a formal reduction of arbitrary q to the uniform distribution case. It is therefore enoughto consider q = u k , the uniform distribution over [ k ] ; identity testing for u k is termed the ( k, ε, δ ) - uniformity testing problem. For constant δ , the sample complexity of ( k, ε ) -uniformity testing was shownto be Θ (cid:0) √ k/ε (cid:1) in [39], [44], and the exact dependence on δ was later identified in [34], [20].Uniformity testing, too, can be obtained as a special case of our general formulation by setting X = [ k ] , C = { u k } ∪ { p ∈ ∆ k : d TV ( p , u k ) > ε } , E = { , } , and the -dimensional loss function l : C × E → R to be l ( p , b ) = b · { p = u k } ,l ( p , b ) = (1 − b ) · { p = u k } , for b ∈ { , } . For simplicity, we consider the error parameter ~γ = ( δ, δ ) . For this case, we termthe δ -inference protocol an ℓ -bit ( k, ε, δ ) - uniformity testing protocol for n players. We provide ( k, ε, δ ) -uniformity testing protocols for arbitrary δ , but we establish lower bounds only for δ = 1 / . This choiceof probability of error is to remain consistent with [3], since we borrow the general lower bounds fromthere. For simplicity we will refer to ( k, ε, / -uniformity testing protocols simply as ( k, ε ) - uniformitytesting protocols . The sample complexity for a fixed q has been studied under the “instance-optimal” setting (see [43], [12]): while the questionis not fully resolved, nearly-tight upper and lower bounds are known. We observe that by this formulation allows, more generally, to study the dependence of sample complexity Type-I and Type-IIerror probabilities δ and δ by considering ~γ = ( δ , δ ) . DRAFT4
Note that distributed variants of several other inference problems such as that of estimating functionalsof distributions and parametric estimation problems can be included as instantiations of the distributedinference problem described above. IV. D
ISTRIBUTED SIMULATION
In this section, we consider the distributed simulation problem described in Section III-B. We start byconsidering the more ambitious problem of perfect simulation, where using a finite number of players n ,the referee must simulate a sample from the unknown p using the ℓ -bit messages from the players. Wethen consider the relaxed problem of α -simulation for a constant α ∈ (0 , (see Definition III.3). Weprove the following results for these problems.1) In Section IV-A, we show that for any ℓ < ⌈ log k ⌉ and finite n , perfect simulation is impossibleusing n players.2) In Section IV-B, for a constant α ∈ (0 , , we exhibit an ℓ -bit private-coin α -simulation protocolfor k -ary distributions using O ( k/ ℓ ) players.3) Finally, in Section V-C, drawing on the lower bounds for distribution learning, we will prove thesample-optimality of our distributed simulation algorithm above up to constant factors. In fact, evenwith public coins the number of players cannot be reduced by more than a constant factor.We have defined the distributed simulation problem as one where the output distribution conditionedon not outputting ⊥ is identical to p . One may wonder about another natural relaxation to perfectsimulation, where the goal is to generate a sample according to a distribution that is α -close to p (say,in total variation distance). A primary reason for considering the former is that the ability to generatesamples from p will allow us to compose it with a centralized algorithm for any inference task, as weshow in Section V. A. Impossibility of perfect simulation when ℓ < log k We show that any simulation that works for all points in the interior of the ( k − -dimensionalprobability simplex must fail for a distribution on the boundary. Our main result of this section is thefollowing: Theorem IV.1.
For any n ≥ , there exists no ℓ -bit perfect simulation for k -ary distributions using n players unless ℓ ≥ ⌈ log k ⌉ .Proof. Suppose that for ℓ < ⌈ log k ⌉ there exists an ℓ -bit (public-coin) perfect simulation S = ( π, T ) for k -ary distributions using n players. Fix a realization U = u of the public randomness. Since ℓ < ⌈ log k ⌉ , DRAFT5 by the pigeonhole principle for each player at least two symbols in [ k ] map to the same message.Therefore, we can find a message vector ( m , . . . , m n ) and distinct elements x i , x ′ i ∈ [ k ] for each i ∈ [ n ] such that π i ( x i , u ) = π i ( x ′ i , u ) = m i , (2)that is for U = u , the SMP protocol sends the same message vector m when the observation of playersis ( x , . . . , x n ) or ( x ′ , . . . , x ′ n ) . For a perfect simulation, the referee is not allowed to output ⊥ , and itmust output a symbol in [ k ] .Next, consider a message m and a symbol x ∈ [ k ] such that T u ( x | m ) > , namely the referee outputs x with a nonzero probability when the public randomness is U = u and the message received is m . Thekey observation in our proof is that since x i = x ′ i in view of (2), for each i either x i = x or x ′ i = x .Without loss of of generality, we assume that x i = x for each ≤ i ≤ n .Finally, consider a distribution p such that p x = 0 and p x ′ > for all x ′ = x . For perfectsimulation, under this distribution, the referee must never declare x . However, conditioned on the public-coin realization being U = u , the probability of observing the message ( m , . . . , m n ) above is Pr[ M = ( m , . . . , m n ) | U = u ] = X ˜ x n Y i =1 T u ( m i | ˜ x i ) p (˜ x i ) ≥ n Y i =1 T u ( m i | x i ) p x i > , thus showing that the referee has a nonzero probability of outputting x , even though p x = 0 . Thiscontradictions the assumption that S is a perfect simulation.Note that the proof above shows that any perfect simulation of a distribution p in the interior of the ( k − -dimensional probability simplex must fail for at least one distribution on the boundary of thesimplex. In fact, a much stronger impossibility result holds. For the smallest non-trivial parameter valuesof k = 3 and ℓ = 1 , no perfect simulation protocol exists that simulates all distributions in any openneighborhood in the interior of the probability simplex. Theorem IV.2.
For any n ≥ , there does not exist any ℓ -bit perfect simulation of ternary distributions( k = 3 ) unless ℓ ≥ , even under when the input distribution is known to comes from an open set in theinterior of the probability simplex. We defer the proof of this theorem to Appendix A. Roughly speaking, the argument proceeds byestablishing that we can, without loss of generality, restrict to deterministic protocols. We then showthat any deterministic simulation protocol must output ⊥ with a nonzero probability – contradicting theassumption of perfect simulation. Together, the two incomparable impossibility results of Theorems IV.1and IV.2 (one for general ≤ ℓ < ⌈ log k ⌉ but at the boundary of the probability simplex; the other for DRAFT6 ℓ = 2 and k ≥ , but in the interior) rule out perfect simulation in a strong sense in the case of SMPprotocols.We close this section by extending our impossibility result to beyond SMP protocols, to the settingwhere the players are allowed to communicate interactively. In an interactive communication protocol ,players to n communicate sequentially in rounds, with player i communicating in round i . The commu-nication is in a broadcast mode where, along with the referee, the players too receive communication fromeach other. The communication of player i can depend on their local observation and the communicationreceived in the previous i − rounds from the other players.Our next result shows that perfect simulation is impossible, even when players use an interactivecommunication protocol. The proof uses a standard method for simulating sequential protocols withSMP protocols, by increasing the number of players (see, for instance, reduction of round complexityin [38]). Lemma IV.3.
For every n ≥ , if there exists an interactive public-coin ℓ -bit perfect simulation of k -arydistributions with n players, then there exists a public-coin ℓ -bit perfect simulation of k -ary distributionswith ℓn +1 players that uses only SMP.Proof. Consider an interactive communication protocol π for distributed simulation with n players and ℓ bits of communication per player. We can view the overall protocol as a ℓ -ary tree of depth n where eachnode is assigned to a player. An execution of the protocol is a path from the root to the leaf of the tree,namely along any such path each player appears once. This protocol can be simulated non-interactivelyusing at most (2 ℓn − / (2 ℓ − < ℓn +1 players, where players (2 j − + 1) to j send all messagescorrespond to nodes at depth j in the tree. Then, the referee receiving all the messages can output theindex of the leaf node by following the path from root to the leaf. Corollary IV.4.
Theorems IV.1 and IV.2 hold even when the players are allowed to use interactivecommunication protocols for simulation.B. An α -simulation protocol using rejection sampling In this section we present our construction of a simulation protocol for k -ary distributions using n = O ( k/ ℓ ) players, establishing the following theorem: Public-coin protocols do allow the players to coordinate using shared randomness. But they do not interact in any other way.
DRAFT7
Theorem IV.5.
For every α ∈ (0 , and ℓ ≥ , there exists an ℓ -bit α -simulation of k -ary distributionsusing (cid:24) log 1 α (cid:25) · (cid:24) k ℓ − (cid:25) players. Moreover, the protocol is deterministic for the players, and only requires private randomness atthe referee. At a high level, our algorithm divides players into batches and constructs a / -simulation using eachbatch. The overall simulation declares the output symbol of the first batch that does not declare an abort.By using O ( (cid:6) log α (cid:7) ) batches, we can boost the probability of abort from / to α .To simplify the presentation, we first present the protocol for ℓ = 1 and analyze its performance. Evenfor this case, we build our protocol in steps, starting with the basic version given in Algorithm 1 below,which requires n = 2 k players. The next result characterizes the performance of this simulation protocol. Algorithm 1
Distributed simulation protocol using ℓ = 1 : The basic version Require: n = 2 k players observing one independent sample each from an unknown p For ≤ i ≤ n , players (2 i − and i send one bit to indicate whether their observation is i . The referee receives these n = 2 k bits M , . . . , M n . if exactly one of the bits M , M , . . . , M k − is equal to one, say the bit M i − , and the correspondingbit M i is zero, then the referee outputs ˆ X = i ; else the referee outputs ⊥ (abort). end ifTheorem IV.6. The protocol in Algorithm 1 uses k players and is a / -simulation for p ∈ ∆ k suchthat k p k ∞ ≤ / .Proof. From the description of the protocol, it is easy to verify that the output ˆ X of the protocol takesthe value i with probability Pr h ˆ X = i i = p i · Y j = i (1 − p j ) · (1 − p i ) = p i · k Y j =1 (1 − p j ) , (3)where the first term in the product corresponds to M i − being , the second term to all the other messagesfrom odd-numbered players being , and the final term for M i to be . Note that this probability isproportional to p i , showing that conditioned on the event { ˆ X ∈ [ k ] } , the output is indeed distributedaccording to p . DRAFT8
Next, we bound the probability of abort for this protocol. By summing (3) over all i in [ k ] , we obtainthat the probability ρ p := Pr[ R does not output ⊥ ] is given by ρ p = k Y j =1 (1 − p j ) . Observe that while (as discussed above), conditioned on success, the output is from p , the probabilityof abort can depend on p . In particular, if there is one symbol with large probability (close to one), thesuccess probability can be arbitrarily close to zero. This is where we use our assumption k p k ∞ ≤ / to establish that ρ p = k Y j =1 (1 − p j ) ≥ . Indeed, the claimed bound follows from observing that − x ≥ / x for all x ∈ [0 , / . Therefore, theprobability of aborting is bounded above by / , completing the proof.To handle the case when k p k ∞ may exceed / , we consider the distribution q on [2 k ] defined by q i = q k + i = 12 · p i , i ∈ [ k ] . This distribution satisfies the condition k q k ∞ ≤ / , and therefore, the previous protocol yields / -simulation for it using k players observing independent samples from q . The problem now reducesto obtaining samples from q using samples from p , and then obtaining back a sample from p given asample from q generated by the referee. Towards that, we note that although the players do not know p ,given a sample from p , it is easy to convert it into a sample from q as follows. Player j upon receiving X j ∼ p , maps it to X j or X j + k with equal probability. We can use this process to convert samplesfrom k players to sample from q and apply Algorithm 1 to simulate a sample ˜ X from q at the referee.Finally, we can convert the sample ˜ X from q to that from p by declaring ˆ X = ( ˜ X mod k ) + 1 . Ourenhancement of Algorithm 1 described next does exactly this, with a slight modification to avoid the useof additional randomness at the players (but instead using randomness at the referee only). This protocolachieves our desired performance for the case ℓ = 1 . Theorem IV.7.
The protocol in Algorithm 2 uses k players and is a / -simulation for p ∈ ∆ k .Moreover, the communication protocol used by the players is a deterministic protocol.Proof. We first establish the following claim.
Claim IV.8.
The distribution of flipped bits obtained after Line 2 coincides with that for message bitswhen we execute Algorithm 1 using samples from q . DRAFT9
Algorithm 2
Distributed simulation protocol using ℓ = 1 : The enhanced version Require: n = 4 k players observing one independent sample each from an unknown p Players divide themselves in two sets of k players each, and each set executes a copy of Algorithm 1. The referee receives message bits ( M , . . . , M k ) from all the players, and independently flips eachmessage bit that is to with probability / to obtain ( M , . . . , M k ) . if exactly one of the message bits M , M , . . . , M k − is , say the message M i − , and thecorresponding message sequence M i is , then if i > k , then the referee updates i ← i − k ; end if the referee outputs ˆ X = i ; else the referee outputs ⊥ . end if To see this, note that, for i ∈ [ k ] , players i and i + k send the message with probability p i each.Therefore, the flipped bits of these players will equal with probabilities q i = p i / each. But this isexactly the probability with which these messages would be if the samples of the players were generatedfrom q and we were executing Algorithm 1.Next, note that the operation of the referee from here on can be described alternatively as obtaining ˜ X by executing Algorithm 1 for · k = 4 k samples from q and declaring ˆ X = ˜ X mod k if ˜ X = ⊥ .Thus, the overall protocol behaves as if the players and the referee executed Algorithm 1 for samplesfrom q and then the referee declared the output mod k + 1 , if it was not a ⊥ . As we saw above, thisprotocol constitutes a / -simulation for p .Moving now to the more general setting of arbitrary ℓ ∈ { , . . . , ⌈ log k ⌉} , we simply modify Algo-rithm 2 to use the extra bits of communication. For simplicity, we assume that ℓ − divides k andset m := k/ (2 ℓ − . We partition the domain [ k ] into m equal contiguous parts S , . . . , S m , with | S i | = 2 ℓ − . Our proposed modification to Algorithm 2 to extend it for ℓ ≥ is given in Algorithm 3.The previous protocol can be developed incrementally in the same manner as the protocol for ℓ = 1 .First, we obtain a protocol under some additional assumption on p using l k ℓ − m players and thencircumvent the requirement for that assumption by converting samples from p into samples for q bydoubling the number of players. The form above is obtained in the same manner as that of Algorithm 2,by relegating the requirement for randomization at the players to the referee.The performance of this protocol is characterized in the theorem below. DRAFT0
Algorithm 3
Distributed simulation protocol using ℓ ≥ : Basic block Require: n = 4 m players observing one independent sample each from an unknown p Players j − , j, j + m ) − , j + m ) , ≤ j ≤ m , send the following communication dependingon their observed sample x : if x / ∈ S j , then send the all zero sequence of length ℓ . else indicate the precise value of x ∈ S j using the remaining ℓ − binary sequences of length ℓ .We denote the sequence sent for i ∈ S j by s i ∈ { , } ℓ \ { } . end if The referee independently changes the message M j from player j that is not to with probability / , to obtain the flipped message M j . if exactly one of the message sequences M , M , . . . , M m − is nonzero, say the message M j − ,and the corresponding message sequence M j is , then if j > m , then the referee updates j ← j − m ; end if if M j − = s i , the referee outputs ˆ X = i ∈ S j ; else the referee outputs ˆ X = ⊥ . end ifTheorem IV.9. For any ℓ ≥ , Algorithm 3 uses l k ℓ − m players and is a / -simulation for p ∈ ∆ k .Moreover, the communication protocol used by the players is a deterministic protocol.Proof. The proof is similar to that of Theorem IV.7, with appropriate extensions to handle ℓ > . Notethat the players in the set P j := { j − , j, j + m ) − , j + m ) } , j ∈ [ m ] , use the same mapping todetermine the message to send. Let i ∈ S j . Then, for all players in the set P j , the flipped message equals s i (the sequence representing message i ) with probability p i / . It follows that the flipped message is for any of these players with probability (1 − p ( S j ) / . Denoting j i the j ∈ [ m ] such that i ∈ S j , notethat only players in P j i can declare s i with positive probability. Therefore, by combining the previousobservations with the fact that the messages of all players are independent, we get Pr h ˆ X = i i = 2 · p i · Y j = j i (cid:18) − p ( S j )2 (cid:19) · (cid:18) − p ( S j i )2 (cid:19) , where the first factor of represents two cases where M j i − = s i or M j i + m ) − = s i , Q j = j i (1 − p ( S j ) / is the probability that each of the flipped messages M t − is for t = j i or t = j i + m , andthe final factor (1 − p ( S j i / is the probability that M t = 0 for t = j i or t = j i + m . As a consequence, DRAFT1 we get that Pr h ˆ X = ⊥ i = Y j ∈ [ m ] (cid:18) − p ( S j )2 (cid:19) ≥ , where in the final bound we used once again the fact that − x ≥ / x for ≤ x ≤ / . This completesthe proof.Finally, we boost the probability of successful simulation from / to by using multiple blocks. Algorithm 4
Distributed simulation protocol using ℓ ≥ : Complete protocol Require: n = 40 (cid:6) log α (cid:7) · l k ℓ − m players observing one independent sample each from an unknown p Divide players into (cid:6) log α (cid:7) disjoint groups of l k ℓ − m players each. Execute Algorithm 3 to each block successively, one block at a time. if all blocks do not declare ⊥ as the output, then output ˆ X = i where i ∈ [ k ] is the output of thefirst block that does not output ⊥ ; else output ˆ X = ⊥ and terminate. end if We conclude with proof establishing that Algorithm 4 attains the performance claimed in Theorem IV.5.
Proof of Theorem IV.5.
Each group in Algorithm 4 executes the / -simulation protocol given in Algo-rithm 3, and the overall protocol outputs the symbol in [ k ] that the first group to succeed outputs, ifsuch a group exists. This is a simple rejection sampling procedure, and clearly, conditioned on no abort,the distribution of output is p . Furthermore, the algorithm declares ⊥ if all the groups declare ⊥ , whichhappens with probability at most (3 / ⌈ log α ⌉ < α .V. S IMULATE - AND -I NFER
We now show how to use distributed simulation results to design private-coin distributed inferenceprotocols. The approach is natural: Simulate enough independent samples at the referee R to solve thecentralized problem. We first describe the implications of the results from Section IV for any distributedinference task, and then instantiate them to our two flagship applications: distribution learning and identitytesting. A. Private-coin ℓ -bit distributed inference via distributed simulation Using the distributed simulation protocols of the previous section, we can simulate one sample from p at the referee using about ( k/ ℓ ) players. Then, to solve an inference task in the distributed setting, DRAFT2 the referee can simulate the number of samples needed to solve the task in the centralized setting. Theresulting protocol will require a number of players roughly equal to the sample complexity of the inferenceproblem when the samples are centralized times (cid:0) k/ ℓ ) , the number of players required to simulate eachindependent sample at the referee. We refer to protocols that first simulate samples from the underlyingdistribution and then use a centralized inference algorithm at the referee as simulate-and-infer protocols.For concreteness, we provide a formal description in Algorithm 5. For ~γ ∈ R q + , let ψ P ( ~γ ) denote the Algorithm 5
The simulate-and-infer protocol for P = ( C , X , E , l ) Require:
Parameters C , N , n = 4 CN l k ℓ − m players observing one sample each from an unknown p ,and a (centralized) estimator e for P requiring N samples Partition the players into blocks of size l k ℓ − m . Execute instances of the distributed simulation protocol given in Algorithm 3 on each block. if at least N instances return (independent) samples ˆ X = ⊥ , then take a subset ( ˆ X , . . . , ˆ X N ) ofthese samples and output ˆ e = e ( ˆ X , . . . , ˆ X N ) ; else output an arbitrary element ˆ e ∈ E . end if sample complexity for ~γ -inference protocol to solve P in the centralized setting. That is, ψ P ( ~γ ) denotesthe least n for which there exists an estimator e such that for every p ∈ C and n independent samplesfrom p , we have L i ( p , e ) ≤ γ i , ∀ ≤ i ≤ q , where L ∈ R q + is defined in (1). The next result evaluates the performance of Algorithm 5. Theorem V.1.
Let P = ( C , X , E , l ) be an inference problem with bounded loss l : C × E → R q ; i.e., k l k ∞ ≤ . For < δ , ≤ ℓ ≤ ⌈ log k ⌉ , and ~γ ∈ R q + , upon setting N = ψ P ( ~γ ) and C = 2 +(1 /ψ P ( ~γ )) log(1 /δ ) , the simulate-and-infer protocol given in Algorithm 5 requires O (cid:0) ( ψ P ( ~γ ) ∨ log δ ) · k ℓ (cid:1) players and constitutes an ℓ -bit deterministic ( ~γ + δ q ) -inference protocol for P .Proof. We denote the resulting distributed inference protocol by ( π, e ′ ) , and proceed to show it is a ( ~γ + δ q ) -inference protocol for P . From Theorem IV.9, each block produces independently a samplewith probability at least / (and ⊥ otherwise). Thus, by Hoeffding’s inequality, the number of samplessimulated is larger than N = ψ P ( ~γ ) with probability at least − δ as long as (5 C − / C ≥ DRAFT3 /ψ P ( ~γ ) log(1 /δ ) , which is satisfied for C ≥ /ψ P ( ~γ )) log(1 /δ ) . Denoting by E the event that thereferee can simulate at least ψ P ( ~γ ) samples, the expected loss satisfies L i ( p , e ′ ) ≤ (1 − δ ) E [ l i ( p , ˆ e ) | E ] + δ E (cid:2) l i ( p , ˆ e ) (cid:12)(cid:12) ¯ E (cid:3) ≤ E [ l i ( p , ˆ e ) | E ] + δ k l i k ∞ ≤ L i ( p , e ) + δ ≤ γ i + δ, for every ≤ i ≤ q , concluding the proof.The theorem above is quite general and only requires that the loss function be bounded. Further, itis worth noting that the dependence on δ is very mild and can even be ignored, for instance, it settingswhen ~γ = γ q with γ ≍ δ and ψ P ( ~γ ) & log(1 /δ ) (as the next two examples will illustrate). B. Application: private-coin protocols from distributed simulation
As corollaries of Theorem V.1, we obtain distributed inference protocols for distribution learning andidentity testing.Using the well-known result that Θ (cid:0) ( k + log(1 /δ )) /ε (cid:1) samples are sufficient to learn a distributionover [ k ] to within a total variation distance ε with probability − δ , we obtain the following. Corollary V.2.
For ℓ ∈ { , . . . , ⌈ log k ⌉} , simulate-and-infer constitutes an ℓ -bit deterministic ( k, ε, δ ) -learning protocol with O (cid:0) k ℓ ε ( k + log(1 /δ )) (cid:1) players. In particular, for any constant δ , O ( k / ℓ ε ) players suffice. For identity testing, it is known that the sample complexity is O (( p k log(1 /δ ) + log(1 /δ )) /ε ) samples( cf. [34], [20]). Thus, we get the following corollary to Theorem V.1. Corollary V.3.
For ℓ ∈ { , . . . , ⌈ log k ⌉} , simulate-and-infer constitutes an ℓ -bit deterministic ( k, ε, δ ) -identity testing protocol with O (cid:16) k ℓ ε ( p k log(1 /δ ) + log(1 /δ )) (cid:17) players. In particular, for any constant δ , O ( k / / ℓ ε ) players suffice.Remark V.4 . We highlight that for constant δ , the two corollaries above are known to be optimal amongall private-coin protocols. Indeed, up to constant factors they achieve the sample complexity lower bounds In particular, it is immediate to extend it to the more general bounded case k l k ∞ < ∞ , instead of k l k ∞ ≤ . This can be shown, for instance, by considering the empirical distribution ˆ p and using McDiarmid’s inequality to bound theprobability of error event { d TV ( p , ˆ p ) > ε } . DRAFT4 established in [3] for private-coin learning and uniformity testing protocols, respectively. In particular, weremark that Corollary V.3 shows that simulate-and-infer attains the sample complexity
Θ(() k / / (2 ℓ ε )) of identity testing using private-coin protocols. C. Optimality of our distributed simulation protocol
Interestingly, a byproduct of our performance bound for simulate-and-infer protocols (more precisely,that of Corollary V.2) is that the α -simulation protocol from Theorem IV.9 has optimal number of players,up to constants. Corollary V.5.
For ℓ ∈ { , . . . , ⌈ log k ⌉} and α ∈ (0 , , any ℓ -bit public-coin (possibly interactive) α -simulation protocol for k -ary distributions must have n = Ω( k/ ℓ ) players.Proof. Let π be any ℓ -bit α -simulation protocol with n players. Proceeding analogously to proofsof Theorem V.1 and Corollary V.2, we get that π can be used to get an ℓ -bit ( k, ε, / -learning protocolfor n ′ = O (cid:0) n · k/ε (cid:1) players. (Moreover, the resulting protocol is adaptive, private- or public-coin,respectively, whenever π is.) However, as shown in [32] (see, also, [3]), any ℓ -bit public-coin (possiblyinteractive) ( k, ε, / -learning protocol must have Ω (cid:0) k / (2 ℓ ε ) (cid:1) players. It follows that n must satisfy n & k/ ℓ , as claimed. VI. P UBLIC - COIN IDENTITY TESTING
In this section, we consider public-coin protocols for ( k, ε ) -identity testing and establish the followingupper bound for the number of players required. Theorem VI.1.
For ≤ ℓ ≤ ⌈ log k ⌉ , there exists an ℓ -bit public-coin ( k, ε ) -identity testing protocol for n = O (cid:0) k ℓ/ ε (cid:1) players. In view of Remark V.4, public-coin protocols require a factor p k/ ℓ fewer samples than private-coinprotocols for identity testing. This work is one of the first instances of a natural distributed inferenceproblem where the availability of public coins changes the sample complexity. In fact, it follows from [3]that the sample requirement of O (cid:0) k ℓ/ ε (cid:1) in Theorem VI.1 is optimal for public-coin protocols. Thus,our work provides sample optimal private- and public-coin protocols for identity testing.We now present the public-coin protocol for distributed identity testing that attains the results above.The basic idea driving our scheme is simple: We find an ℓ -bit random mapping that preserves pairwisedistances between distributions of its inputs up to a fixed multiplicative factor and apply an appropriate DRAFT5 identity test to its output. We first identify these random mappings in the technical result below, whichmaybe of independent interest.Let ( S , . . . , S L ) be a random partition of the set [ k ] with each part of equal cardinality, that is ( S , . . . , S L ) is distributed uniformly over the set of all partitions of [ k ] into k/L parts of equal cardinality.For a distribution p ∈ ∆ [ k ] , define random variables Z ( p ) , . . . , Z L ( p ) as follows: Z r ( p ) := p ( S r ) , r ∈ [ L ] . (4)Note that each Z r ( p ) is nonnegative and P r Z r ( p ) = 1 . Thus, Z ( p ) , . . . , Z L ( p ) is a random distributionover [ L ] . For two distributions p and q , if p = q , clearly the induced distributions ( Z ( p ) , . . . , Z L ( p )) and ( Z ( q ) , . . . , Z L ( q )) are identical. The next result shows that if p and q are far (in total variationdistance), then the induced distributions, too, are far (in ℓ distance). Theorem VI.2.
Fix any k -ary distributions p , q . For the (random) distributions p = ( Z ( p ) , . . . , Z L ( p )) , q = ( Z ( q ) , . . . , Z L ( q )) over [ L ] defined in Eq. (4) above, the following holds: (i) if p = q , then p = q with probability one; and (ii) if d TV ( p , q ) > ε , then Pr (cid:20) k p − q k > ε k (cid:21) ≥ c . for some absolute constant c > . The proof of this result involves proving the anticoncentration of P r ∈ [ L ] (cid:16)P j ∈ [ k ] ( p j − q j ) { j ∈ S r } (cid:17) .Since the random variables { j ∈ S r } are dependent, the analysis becomes technical, and requires analyzingthe higher moments of the summation above, and applying the Paley–Zygmund inequality. The completeproof is deferred to Appendix B.Our proposed test uses public randomness to sample the random partition ( S , . . . , S L ) . In our ap-plication, we set L = 2 ℓ whereby each player can describe the part in which its sample lies using ℓ .For convenience, we represent the partition ( S , . . . , S L ) using length- k vector ( Y , . . . , Y k ) , where for j = 1 , . . . , k , we have Y j ∈ [ L ] := { , . . . , L } . Each of the Y j can be indicated using ℓ bits. A playerobserving sample i can send Y i represented by ℓ bits to the referee. When each player applies this strategy,the referee accumulates n samples from ( Z ( p ) , . . . , Z L ( p )) , and it can apply a centralized test to it.Before describing the general scheme formally, we illustrate it for the case of ℓ = 1 and for uniformitytesting. For this case, ( Z ( p ) , Z ( p )) = (1 / , / when p = q and, by Theorem VI.2, a binarydistribution with bias roughly ε/ √ k when d TV ( p , q ) > ε . Thus, the referee can simply test if thebits received are unbiased or have bias greater than ε/ √ k . As is well-known, it can do this using roughly ( k/ε ) samples, and so ( k/ε ) players suffice. DRAFT6
For testing uniformity with ℓ > , when p = u k , the referee sees a uniform random variable withvalues in [2 ℓ ] . However, when d TV ( p , q ) > ε , we only know that the observed (2 ℓ ) -ary random variablehas distribution that is away from uniform in ℓ distance (with constant probability), and not d TV asabove. Such a test appeared in, for instance, [18, Proposition 3.1] or [17, Theorem 2.10]. In particular,we have a centralized test for testing if an L -ary distribution is uniform or ( γ/ √ L ) -far from uniform in ℓ using O (cid:16) √ L/γ (cid:17) samples.For the general case when the reference distribution q can be different from uniform, our approachfirst involves reducing identity testing for q to uniformity testing. Towards this, we rely on the followingresult of Goldreich [27], which we state here for completeness. Lemma VI.3.
For any q ∈ ∆ k , there exists a randomized mapping F q : ∆ k → ∆ k satisfying thefollowing properties: (i) F q ( q ) = u k ; (ii) for every p ∈ ∆ k such that d TV ( p , q ) ≥ ε , it holds that d TV ( F q ( p ) , u k ) ≥ ε/ ; and (iii) there is an efficient algorithm for generating a sample from F q ( p ) given one sample from p .Remark VI.4 . The mapping F q and the algorithm mentioned in property (iii) above require the knowledgeof q .With this result at our disposal, each player can simply simulate samples from F q ( p ) when theyobserve samples from p . Thereafter we can simply apply the distributed uniformity test we outlinedearlier, however for a slightly inflated domain of cardinality k .The scheme we have outlined yields a test which under p = q accepts q with a constant probabilityand when p is far from q , rejects it with a constant probability. It remains to “amplify” these constantprobabilities to our desired probability of / . In fact, the amplification technique we present, consid-ered folklore in the computational learning community, allows us to amplify easily the probabilities toany arbitrary δ . We summarize this simple amplification in the next result. Lemma VI.5.
For θ > − θ , consider N independent samples generated from Bern( p ) with either p ≥ θ or p ≤ − θ . Then, for N = O (cid:0) / ( θ + θ − log 1 /δ (cid:1) , we can find a test that accepts p ≥ θ with probability greater than − δ in the first case and rejects it with probability greater than − δ inthe second case. The test is simply the empirical average with an appropriate threshold and the proof follows from astandard Chernoff bound. We omit the details.As a corollary of Lemma VI.5 and Theorem VI.1, we obtain the following result.
DRAFT7
Corollary VI.6.
For ≤ ℓ ≤ ⌈ log k ⌉ , there exists an ℓ -bit public-coin ( k, ε, δ ) -identity testing protocolfor n = O (cid:0) k ℓ/ ε log δ (cid:1) players.Proof. Recall that by our definition of ( k, ε ) -identity testing and Theorem VI.1, we are given a test withprobability of correctness greater than / . Thus, when p = q , the referee’s output bit takes value with probability exceeding / and when d TV ( p , q ) ≥ ε , the output bit takes value with probabilityexceeding / . Therefore, the claimed test in the statement of the corollary is obtained by applying thetest of Theorem VI.1 to O (log 1 /δ ) blocks of O (cid:0) k/ ℓ/ ε (cid:1) players and applying the test in Lemma VI.5to the binary outputs of these tests.We summarize our overall distributed identity test in Algorithm 6 below. Algorithm 6 An ℓ -bit public-coin protocol for distributed identity testing for reference distribution q . Require:
Parameters γ ∈ (0 , , N , n players observing one sample each from an unknown p Players use the algorithm in Lemma VI.3 to convert their samples from p to independent samples ˜ X , . . . , ˜ X n from F q ( p ) ∈ ∆ k . ⊲ This step uses only private randomness. Partition the players into N blocks of size m := n/N . Players in each block use independent public coins to sample a random partition ( S , . . . , S L ) withequal-sized parts. We represent this partition by ( Y , . . . , Y k ) with Y r ∈ [ L ] as mentioned above. Upon observing the sample ˜ X j = i in Line 1, player j sends Y i (corresponding to its respectiveblock) represented by ℓ bits. For each block, the referee obtains n/N independent samples from ( Z ( p ) , . . . , Z L ( p )) and testsif the underlying distribution is u L or ( γ/ √ L ) -far from uniform in ℓ , with failure probability δ ′ := c/ − c ) . ⊲ This uses the aforementioned test from [18], [17]; c > is as in Theorem VI.2. The referee applies the test from Lemma VI.5 to the N outputs of the independent tests (one foreach block) and declares the output.We now show that with appropriate choice of parameters, Algorithm 6 attains the performance promisedin Theorem VI.1. Proof of Theorem VI.1 . Our proof rests on two technical results pointed above: Theorem VI.2 and Lemma VI.3.Consider the distributed identity test given in Algorithm 6. First, by Lemma VI.3, for any reference dis-tribution q the samples obtained by the players in Line 1 are independent samples from u k when p = q and from a distribution that is (16 ε/ -far from u k in total variation distance when d TV ( p , q ) > ε .The samples ( ˜ X , . . . , ˜ X n ) are then “quantized” to ℓ bits in each block. For each block of m = k/N players, we can consider the samples seen by the referee as m independent samples from an DRAFT8 unknown distribution on [ L ] . By the previous observation and Theorem VI.2, the common distributionof independent samples at the referee in each block is either u L with probability when p = q , or ( ε/ k ) -far from u L in ℓ distance with probability greater than c .We set γ := ε √ L/ √ k and apply the test from [18] or [17]. The test will succeed if the eventin Theorem VI.2 occurs and the centralized uniformity test succeeds. By [18, Proposition 3.1] or [17,Theorem 2.10], this happens with probability greater than (1 − δ ′ ) c if the number of samples m in eachblock exceeds √ Lγ = 10 k √ LLε = 10 k √ Lε . (5)We set the number of players in each block as m := (cid:6) k/ (2 ℓ/ ε ) (cid:7) . Note that the parameter δ ′ here isthe chosen probability of failure of the centralized test. For our purpose, we shall see that it suffices toset it to δ ′ := c/ c ) .Each block now provides a uniformity test which succeeds with probability exceeding − δ ′ = (1 + c/ / (1 + c ) . Finally, we amplify the probability of success by choosing the number of blocks N to beappropriately large. We do this using the general amplification given in Lemma VI.5. Specifically, when p = q , the test for each of the block outputs with probability greater than − δ ′ = (1 + c/ / (1 + c ) .On the other hand, when p is ε -far from q , the test for each block outputs with probability greaterthan (1 − δ ′ ) c = ( c + c / / (1 + c ) . Therefore, the claim follows upon applying Lemma VI.5 with θ := (1 + c/ / (1 + c ) and θ := ( c + c / c ) , which satisfy θ > − θ .Note that the protocol in Algorithm 6 is remarkably simple, and, moreover, is “smooth,” in the sensethat no player’s output depends too much on any particular symbol from [ k ] . (Indeed, each player’s outputis the indicator of a set of k/ ℓ elements, which for constant values of ℓ is Ω( k ) .) This “smoothness”can be a desirable feature when applying such protocols on a distribution whose domain originates froma quantization of a larger or even continuous domain, where the output of the test should not be toosensitive to the particular choice of quantization. Moreover, it is worth noting that the knowledge of theshared randomness by the referee is not used in Algorithm 6. Remark
VI.7 (Amount of shared randomness) . It is easy to see that Algorithm 6 uses no more than O ( ℓk ) bits of shared randomness. Indeed, N = Θ(1) independent partitions of [ k ] into L := 2 ℓ equal-sized partsare chosen and each such partition can be specified using O (log( L k )) = O ( k · ℓ ) bits. As mentionedin the preceding discussion, the proof of Theorem VI.1 hinges on Theorem VI.2, whose proof relies inturn on an anticoncentration argument only involving moments of order four or less of suitable random The extra factor of is from Lemma VI.3. DRAFT9 variables. As such, one could hope that using -wise independence (or a related notion) to sample therandom equipartition of [ k ] may lead to drastic savings in the number of shared random bits required toimplement the protocol.This is indeed the case, with a caveat: namely, a straightforward way to implement Theorem VI.2 wouldbe to require a -wise independent family of permutations of [ k ] (see, e.g. , [36], [7]). Unfortunately, nonon-trivial t -wise independent family of permutations is known to exist for t > (although their existenceis not ruled out). A way to circumvent this issue and obtain a time- and randomness-efficient protocol using O (log k ) shared random bits, is instead to observe that Theorem VI.2 still holds for a uniformly randompartition (instead of equipartition) of [ k ] in L pieces. This is because its proof invokes Theorem A.6,which only requires suitable -symmetric random variables. An efficient implementation then can rely ona family of k -wise independent random bits, for which explicit constructions with a seed length O (log k ) are known. However, this approach hits another stumbling block, as when p = q the resulting distribution ( Z ( q ) , . . . , Z L ( q )) on [ L ] need not be uniform (as the partition is no longer in equal-sized parts), andthus the sample complexity from (5) (which holds for uniformity testing in ℓ distance) does not follow.We explain in Appendix C how to circumvent this difficulty and obtain a variant of Theorem VI.1 usingonly O (log k ) shared random bits. Remark
VI.8 (Instance-optimal testing) . It may be of independent interest to consider instance-optimalidentity testing in the sense of Valiant and Valiant [43], namely to examine how the number of playersneeded depend on q instead of the worst-case parameter k . Towards that, we describe an extension ofGoldreich’s reduction in Appendix D which makes it amenable to the instance-optimal setting, and webelieve will find further applications. A CKNOWLEDGMENTS
The authors would like to thank the organizers of the 2018 Information Theory and ApplicationsWorkshop (ITA), where the collaboration leading to this work started.A
PPENDIX
A. Impossibility of perfect simulation in the interior of the probability simplex
In this appendix, we establish Theorem IV.2, restated below: Specifically, given such a family F , one can obtain an equipartition of [ k ] in L pieces meeting our requirements by firstfixing any equipartition Π of [ k ] in L pieces, then drawing a permutation σ ∈ F uniformly at random, with log |F| independentuniformly random bits, and applying σ to Π . DRAFT0
Theorem A.1.
For any n ≥ , there does not exist any ℓ -bit perfect simulation of ternary distributions( k = 3 ) unless ℓ ≥ , even under when the input distribution is known to comes from an open set in theinterior of the probability simplex. Before we prove the theorem, we show that there is no loss of generality in restricting to deterministicprotocols, namely protocols where each player uses a deterministic function of their observation tocommunicate. The high-level argument is relatively simple: By replacing player j by two players j , j ,each with a suitable deterministic strategy, the two -bit messages received by the referee will allow it tosimulate player j ’s original randomized mapping. A similar derandomization was implicit in Algorithm 2. Lemma A.2.
For X = { , , } , suppose there exists a -bit perfect simulation S ′ = ( π ′ , δ ′ ) with n players. Then, we can find a -bit perfect deterministic simulation S = ( π, δ ) with n players such that,for each j ∈ [2 n ] , the communication π j sent by player j is a deterministic function of the sample x j seen by player j , i.e., π j ( x, u ) = π j ( x ) , x ∈ X . Proof.
Consider the mapping f : { , , } × { , } ∗ → { , } . We will show that we can find mappings g : { , , } → { , } , g : { , , } → { , } , and h : { , } × { , } × { , } ∗ → { , } such that forevery u Pr[ f ( X, u ) = 1 ] = Pr[ h ( g ( X ) , g ( X ) , u ) = 1 ] , (6)where random variables X , X take values in { , , } and are independent and identically distributed,with same distribution as X . We can then use this construction to get our claimed simulation S Using n players as follows: Replace the communication π ′ j ( x, u ) from player j with communication π j − ( x j − ) and π j ( x j ) , respectively, from two players j − and j , where π j − and π j correspond to mappings g and g above for f = π ′ j . The referee can then emulate the original protocol using the correspondingmapping h and using h ( π j − ( x j − ) , π j ( x j ) , u ) in place of communication from player j in the originalprotocol. Then, since the probability distribution of the communication does not change, we retain theperformance of S ′ , but using only deterministic communication now.Therefore, it suffices to establish (6). For convenience, denote α u := { f (0 ,u )=1 } , β u := { f (1 ,u )=1 } ,and γ u := { f (2 ,u )=1 } . Consider the case when at most one of α u , β u , γ u is . In this case, we canassume without loss of generality that α u ≤ β u + γ u and ( β u + γ u − α u ) ∈ { , } . Let g i ( x ) = { x = i } for i ∈ { , } . Consider the mapping h given by h (0 , , u ) = α u , h (1 , , u ) = β u , h (0 , , u ) = γ u , h (1 , , u ) = ( β u + γ u − α u ) . DRAFT1
Then, for every u , Pr[ h ( g ( X ) , g ( X ) , u ) = 1 ]= α u (1 − p )(1 − p ) + β u (1 − p ) p + γ u p (1 − p ) + ( β u + γ u − α u ) p p = α u (1 − p − p ) + β u p + γ u p = Pr[ f ( X, u ) = 1 ] , which completes the proof for this case. For the other case, we can simply consider (1 − α u ) , (1 − β u ) ,and (1 − γ u ) and proceed as in the case above to conserve Pr[ h ( g ( X ) , g ( X ) , u ) = 0 ] .We now prove Theorem IV.2, but in view of our previous observation, we only need to considerdeterministic communication. Proof of Theorem IV.2.
Suppose by contradiction that there exists such a -bit deterministic perfectsimulation protocol S = ( π, δ ) for n players on X = { , , } such that π j ( x, u ) = π j ( x ) for all x .Assume that this protocol is correct for all distributions p in the neighborhood of some p ∗ in the interiorof the simplex. Consider a partition the players into three sets S , S , and S , with S i := { j ∈ [ n ] : π j ( i ) = 1 } , i ∈ { , , } . Note that for deterministic communication the message M is independent of public randomness U . Then,by the definition of perfect simulation, it must be the case that p x = E U X m ∈{ , } n δ x ( m, U ) Pr[ M = m | U ] = E U X m δ x ( m, U ) Pr[ M = m ]= X m E U [ δ x ( m, U )] Pr[ M = m ] (7)for every x ∈ X , which with our notation of S , S , S can be re-expressed as p x = X m ∈{ , } n E U [ δ x ( m, U )] Y i =0 Y j ∈S i ( m j p i + (1 − m j )(1 − p i ))= X m ∈{ , } n E U [ δ x ( m, U )] Y i =0 Y j ∈S i (1 − m j + (2 m j − p i ) , for every x ∈ X . But since the right-side above is a polynomial in ( p , p , p ) , it can only be zero inan open set in the interior if it is identically zero. In particular, the constant term must be zero: X m ∈{ , } n E U [ δ x ( m, U )] Y i =0 Y j ∈S i (1 − m j ) = X m ∈{ , } n E U [ δ x ( m, U )] n Y j =1 (1 − m j ) . Noting that every summand is non-negative, this implies that for all x ∈ X and m ∈ { , } n , E U [ δ x ( m, U )] n Y j =1 (1 − m j ) = 0 . DRAFT2
In particular, for the all-zero message n , we get E U [ δ x ( n , U )] = 0 for all x ∈ X , so that again bynon-negativity we must have δ x ( n , u ) = 0 for all x ∈ X and randomness u . But the message n willhappen with probability Pr[ M = n ] = Y i =0 Y j ∈S i (1 − p i ) = (1 − p ) |S | (1 − p ) |S | (1 − p ) |S | > , where the inequality holds since p lies in the interior of the simplex. Therefore, for the output ˆ X of thereferee we have Pr h ˆ X = ⊥ i = X m X x ∈X E U [ δ x ( m, U )] · Pr[ M = m ] = X m = n Pr[ M = m ] X x ∈X E U [ δ x ( m, U )] ≤ X m = n Pr[ M = n ] = 1 − Pr[ M = n ] < , contradicting the fact that π is a perfect simulation protocol. Remark
A.3 . It is unclear how to extend the proof of Theorem IV.2 to arbitrary k, ℓ . In particular, theproof of Lemma A.2 does not extend to the general case. A plausible proof-strategy is a black-boxapplication of the k = 3 , ℓ = 1 result to obtain the general result using a direct-sum-type argument. B. Proof of Theorem VI.2
In this appendix, we prove Theorem VI.2, stating that taking a random balanced partition of the domainin L ≥ parts preserves the ℓ distance between distributions with constant probability. Note that thespecial case of L = 2 was proven in the extended abstract [2], in a similar fashion.We begin by recalling the Paley–Zigmund inequality, a key tool we shall rely upon. Theorem A.4 (Paley–Zygmund) . Suppose U is a non-negative random variable with finite variance.Then, for every θ ∈ [0 , , Pr[
U > θ E [ U ] ] ≥ (1 − θ ) E [ U ] E [ U ] . We will prove a more general version of Theorem VI.2, showing that the ℓ distance to any fixeddistribution q ∈ ∆ [ k ] is preserved with a constant probability with only mild assumptions on Y , . . . , Y k ;recall that we represent the partition ( S , . . . , S L ) using a k -length vector ( Y , . . . , Y k ) with each Y i ∈ [ L ] such that Y i = j ∈ [ L ] if i ∈ S j . Namely, we only require that they be : Definition A.5.
Fix any t ∈ N . The random variables Y , . . . , Y k over Ω are said to be t -symmetric if, forevery i , i , . . . , i t ∈ [ k ] , every s ∈ N , and f , . . . , f s : Ω t → R , the expectation E hQ sj =1 f j ( Y i , . . . , Y i t ) i For this application, one should read the theorem statement with δ := p − q . DRAFT3 may only depend on the multiset { i , i , . . . , i t } via its multiplicities. That is, for every permutation π : [ k ] → [ k ] , E s Y j =1 f j ( Y i , . . . , Y i t ) = E s Y j =1 f j ( Y π ( i ) , . . . , Y π ( i t ) ) . Before stating the general statement we shall establish, we observe that random variables Y , . . . , Y k as in Theorem VI.2 are indeed t -symmetric for any t ∈ [ k ] . Another prominent example of t -symmetricrandom variables is that of independent, or indeed t -wise independent, identically distributed r.v.’s (andindeed, it is easy to see that t -symmetry for t ≥ require that the random variables be identicallydistributed). Moreover, for intuition, one can note that for Ω = { , } , the definition amounts to askingthat the expectation E (cid:2)Q ts =1 Y i s (cid:3) depends only on the multiplicities of the multiset { i , i , . . . , i t } . Theorem A.6 (Probability Perturbation Hashing) . Suppose ≤ L < k is an integer dividing k , and fixany vector δ ∈ R k such that P i ∈ [ k ] δ i = 0 . Let random variables Y , . . . , Y k be -symmetric r.v.’s. Define Z = ( Z , . . . , Z L ) ∈ R L as Z r := k X i =1 δ i { Y i = r } , r ∈ [ L ] . Then, for every α ∈ [0 , / , Pr " Pr[ Y = Y ] − √ α ≤ k Z k k δ k ≤ − α Pr[ Y = Y ] ≥ α . Proof of Theorem A.6.
The gist of the proof is to consider a suitable non-negative random variable(namely, k Z k ) and bound its expectation and second moment in order to apply the Paley–Zygmundinequality to argue about anticoncentration around the mean. The difficulty, however, lies in the fact thatbounding the moments of k Z k involves handling the products of correlated L -valued random variables Y i ’s, which is technical even for the case L = 2 considered in [2]. For ease of presentation, we havedivided the argument into smaller results.In what follows, let random variables Y , . . . , Y k be as in the statement. Since they are -symmetric,expectations of the form E [ f ( Y a , Y b , Y c , Y d ) g ( Y a , Y b , Y c , Y d )] depend only on the number of times eachdistinct element appears in the multiset { a, b, c, d } . For ease of notation, we introduce the quantities DRAFT4 below, for r , r , r ∈ [ L ] (not necessarily distinct): m r := Pr[ Y = r ] ,m r ,r := Pr[ Y = r , Y = r ] ,m r ,r ,r := Pr[ Y = r , Y = r , Y = r ] ,m r ,r ,r ,r := Pr[ Y = r , Y = r , Y = r , Y = r ] . With this notation at our disposal, we are ready to proceed with the proof.
Lemma A.7 (Each part has the right expectation) . For every r ∈ [ L ] , E [ Z r ] = 0 . Proof.
By linearity of expectation, for every r , E [ Z r ] = P ki =1 δ i E (cid:2) { Y i = r } (cid:3) = m r · P ki =1 δ i = 0 . Lemma A.8 (The ℓ distance has the right expectation) . For every r ∈ [ L ] , Var Z r = E (cid:2) Z r (cid:3) = ( m r − m r,r ) k δ k . In particular, the expected squared ℓ norm of Z is E h k Z k i = E " L X r =1 Z r = − L X r =1 m r,r ! k δ k = Pr[ Y = Y ] · k δ k . Proof.
For a fixed r ∈ [ L ] , using the definition of Z , the fact that P ki =1 { Y i = r } = kL , and Lemma A.7,we get that Var[ Z r ] = E (cid:2) Z r (cid:3) = E k X i =1 δ i { Y i = r } ! = X ≤ i,j ≤ k δ i δ j E (cid:2) { Y i = r } { Y j = r } (cid:3) = k X i =1 δ i E (cid:2) { Y i = r } (cid:3) + 2 X ≤ i It follows from Markov’s inequality that Pr (cid:20) k Z k ≤ − α Pr[ Y = Y ] · k δ k (cid:21) ≥ α. (8)For the lower tail bound, we will derive a bound for E (cid:2) Z (cid:3) and invoke as discussed above the Paley–Zygmund inequality. Note that the lower bound trivially holds whenever α > Pr[ Y = Y ] ; thus, wehereafter assume ≤ α ≤ Pr[ Y = Y ] . We have: Lemma A.9 (The ℓ distance has the required second moment) . There exists an absolute constant C > such that E h k Z k i ≤ C k δ k . Moreover, one can take C = 16 .Proof of Lemma A.9. Expanding the square, we have E h k Z k i = E L X r =1 Z r ! = L X r =1 E (cid:2) Z r (cid:3) + 2 X r For every r ∈ [ L ] , E (cid:2) Z r (cid:3) ≤ m r k δ k , and therefore L X r =1 E (cid:2) Z r (cid:3) ≤ k δ k . Proof. We will mimic the proof of Lemma A.8. We first rewrite E (cid:2) Z r (cid:3) = E k X i =1 δ i { Y i = r } ! = X ≤ a,b,c,d ≤ k δ a δ b δ c δ d E (cid:2) { Y a = r } { Y b = r } { Y c = r } { Y d = r } (cid:3) . Using symmetry once again, since every term E (cid:2) { Y a = r } { Y b = r } { Y c = r } { Y d = r } (cid:3) depends only on thenumber of distinct elements in the multiset { a, b, c, d } , it will be equal to one of m r , m r,r , m r,r,r , or m r,r,r,r , and it suffices to keep track of the contribution of each of these four types of terms. From this,letting Σ s := P |{ a,b,c,d }| = s δ a δ b δ c δ d for s ∈ [4] , we get that E (cid:2) Z r (cid:3) = m r Σ + m r,r Σ + m r,r,r Σ + m r,r,r,r Σ . (10)We will rely on the following technical result. DRAFT6 Fact A.11. For Σ , Σ , Σ , and Σ defined as above, we have Σ = k δ k Σ = 3 k δ k − k δ k Σ = 12 k δ k − k δ k Σ = − (Σ + Σ + Σ ) = 3 k δ k − k δ k . Proof of Fact A.11. We start by showing the last equality: “hiding zero,” we get k X i =1 δ i ! = X ≤ a,b,c,d ≤ k δ a δ b δ c δ d = Σ + Σ + Σ + Σ . thus it is enough to establish the stated expressions for Σ , Σ , Σ . The first equality is a direct conse-quence of the definition Σ = P ki =1 δ i = k δ k ; as for the second, we can derive it from Σ = X ≤ a,b,c,d ≤ k |{ a,b,c,d }| =2 δ a δ b δ c δ d = 6 X i Combing (10) with the above fact, we get E (cid:2) Z r (cid:3) = ( m r − m r,r + 12 m r,r,r + 6 m r,r,r,r ) k δ k + 3( m r,r − m r,r,r + m r,r,r,r ) k δ k ≤ ( m r + 5 m r,r,r + 6 m r,r,r,r ) k δ k + 3( m r,r − m r,r,r ) k δ k ≤ ( m r + 3 m r,r + 2 m r,r,r + 6 m r,r,r,r ) k δ k ≤ m r k δ k . leveraging the inequalities k δ k ≤ k δ k and m r,r,r,r ≤ m r,r,r ≤ m r,r ≤ m r .However, we need additional work to handle the second term comprising roughly L summands. Inparticular, to complete the proof we show that each summand in the second term is less than a constantfactor times m r,r ′ k δ k . Claim A.12. We have X r Fix any r = r ′ . As before, we expand E (cid:2) Z r Z r ′ (cid:3) = E k X i =1 δ i { Y i = r } ! k X i =1 δ i { Y i = r ′ } ! = X ≤ a,b,c,d ≤ k δ a δ b δ c δ d E (cid:2) { Y a = r } { Y b = r } { Y c = r ′ } { Y d = r ′ } (cid:3) . We will use -symmetry once again to handle the terms E (cid:2) { Y a = r } { Y b = r } { Y c = r } { Y d = r } (cid:3) . The keyobservation here is that if { a, b } ∩ { c, d } 6 = ∅ , then { Y a = r } { Y b = r } { Y c = r ′ } { Y d = r ′ } = 0 . This will becrucial as it implies that the expected value can only be non-zero if |{ a, b, c, d }| ≥ , yielding an m r,r ′ dependence for the leading term in place of m r . E (cid:2) Z r Z r ′ (cid:3) = X |{ a,b,c,d }| =2 δ a δ b E (cid:2) { Y a = r } { Y b = r ′ } (cid:3) + X |{ a,b,c,d }| =3 δ a δ b δ c E (cid:2) { Y a = r } { Y b = r ′ } { Y c = r ′ } (cid:3) + X |{ a,b,c,d }| =3 δ a δ b δ c E (cid:2) { Y a = r } { Y b = r } { Y c = r ′ } (cid:3) + X |{ a,b,c,d }| =4 δ a δ b δ c δ d E (cid:2) { Y a = r } { Y b = r } { Y c = r ′ } { Y d = r ′ } (cid:3) . (11)The first term, which we will show dominates, can be expressed as X |{ a,b,c,d }| =2 δ a δ b E (cid:2) { Y a = r } { Y b = r ′ } (cid:3) = m r,r ′ k δ k . DRAFT8 For the second and the third terms, noting that X |{ a,b,c,d }| =3 δ a δ b δ c = X ≤ a,b,c ≤ k δ a δ b δ c − X a = b δ a δ b − X a = b δ a δ b with P ≤ a,b,c ≤ k δ a δ b δ c = (cid:16)P ka =1 δ a (cid:17) (cid:16)P ka =1 δ a (cid:17) = 0 , P a = b δ a δ b ≤ P ≤ a,b ≤ k δ a δ b = k δ k , and P a = b δ a | δ b | ≤ P ≤ a,b ≤ k δ a | δ b | ≤ k δ k ∞ k δ k ≤ k δ k , we get − m r,r ′ ,r ′ k δ k ≤ X |{ a,b,c,d }| =3 δ a δ b δ c E (cid:2) { Y a = r } { Y b = r ′ } { Y c = r ′ } (cid:3) ≤ m r,r ′ ,r ′ k δ k . Finally, similar manipulations yield − m r,r,r ′ ,r ′ k δ k ≤ X |{ a,b,c,d }| =4 δ a δ b δ c δ d E (cid:2) { Y a = r } { Y b = r } { Y c = r ′ } { Y d = r ′ } (cid:3) ≤ m r,r,r ′ ,r ′ k δ k . Gathering all this in (11), we get that there exists some absolute constant C ′ > such that X r Since the first item is immediate, it suffices to prove the second, which we donow. Recall that the random variables Y , . . . , Y k from the statement of Theorem VI.2 are such that each DRAFT9 Y i is marginally uniform on [ L ] , and P ki =1 { Y i = r } = kL for every r ∈ [ L ] . In particular, Y , . . . , Y k are -symmetric random variables, as we see below: Pr[ Y = Y ] = 1 − L X r =1 E (cid:2) { Y = r } { Y = r } (cid:3) = 1 − L · k − Lk − ≥ − L ≥ . Further, a simple computation yields E (cid:2) { Y = r } { Y = r } (cid:3) = E (cid:2) E (cid:2) { Y = r } { Y = r } (cid:12)(cid:12) { Y = r } (cid:3)(cid:3) = 1 L Pr[ Y = r | Y = r ]= 1 L Pr " Y = r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − X i =1 { Y i = r } = kL − = 1 L · k − Lk − , where the final identity uses symmetry, along with the observation that k − X i =1 E { Y i = r } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − X j =1 { Y j = r } = kL − = kL − . Therefore, applying Theorem A.6 for α := < Pr[ Y = Y ] , with δ := p − q , we obtain Pr (cid:20) k p − q k ≥ k p − q k (cid:21) ≥ α , which yields the desired statement, since by the Cauchy–Schwarz inequality we have k p − q k > ε k whenever ℓ ( p , q ) > ε . C. A randomness-efficient variant of Theorem VI.1 In this appendix, we describe how the protocol underlying Theorem VI.1, Algorithm 6, can be modifiedto reduce the number of shared bits from the O ( kℓ ) required by Algorithm 6 to only O (log k ) . Theorem A.13. For ≤ ℓ ⌈ log k ⌉ , there exists an ℓ -bit public-coin ( k, ε ) -identity testing protocol for n = O (cid:0) k ℓ/ ε (cid:1) players, using O (log k ) public coins.Proof. The corresponding protocol is provided in Algorithm 7, and it follows the same structure as Al-gorithm 6. As discussed in Remark VI.7, the two main differences are in Lines 3 and 5. In the former,we use a random -wise independent partition of [ k ] in L parts, no longer necessarily equal-sized. Thisallows us to bring down the number of public coins to the stated bound, as guaranteed by the next factapplied with t = 4 : Fact A.14. For any t ≥ , k, ℓ ∈ N , there exists a t -wise independent probability space Ω ⊆ [2 ℓ ] k withuniform marginals, and size | Ω | = 2 t ( ℓ + ⌈ log k ⌉ ) . Moreover, one can efficiently sample from Ω given t, k, ℓ .Proof. The proof relies on a standard construction of t -wise independent (1 / ℓ ) -biased random bits viapolynomials over an appropriate finite field. Namely, fixing a field F of size ℓ + ⌈ log k ⌉ and an equipartition DRAFT0 Algorithm 7 A modified, randomness-efficient ℓ -bit public-coin protocol for distributed identity testingfor reference distribution q . Require: Parameters γ ∈ (0 , , N , n players observing one sample each from an unknown p Players use the algorithm in Lemma VI.3 to convert their samples from p to independent samples ˜ X , . . . , ˜ X n from F q ( p ) ∈ ∆ k . ⊲ This step uses only private randomness. Partition the players into N blocks of size m := n/N . Players in each block use ⌈ log(5 k ) ⌉ + ℓ ) independent public coins to generate (using Fact A.14) k -wise independent uniform r.v.’s Y , . . . , Y k ∈ [ L ] , which they interpret as a random partition ( S , . . . , S L ) of [ k ] in L parts. Upon observing the sample ˜ X j = i in Line 1, player j sends Y i (corresponding to its respectiveblock) represented by ℓ bits. for all block do The referee obtains n/N independent samples from ( Z ( p ) , . . . , Z L ( p )) Knowing the realization of the public coins, it computes the distribution ˜ q ∈ ∆ L correspondingto ( Z ( q ) , . . . , Z L ( q )) . if k ˜ q k ≤ / √ L then it tests if the underlying distribution is ˜ q or ( γ/ √ L ) -far from ˜ q in ℓ ,with failure probability δ ′ ← c/ (2 + c ) where c is as in Theorem VI.2. ⊲ This uses the testfrom [18], stated in Theorem A.15. else it draws a random Bern(1 / and records it as “output of the test” for this block. end if end for The referee applies the test from Lemma VI.5 to the N outputs of the independent tests (one foreach block) and declares the output. F , . . . , F ℓ of F (so that | F | = · · · = | F ℓ | = 2 ⌈ log k ⌉ ), it suffices to sample uniformly at random apolynomial P ∈ F t − [ X ] evaluating it at k (fixed) points a , . . . , a k ∈ F yields t -wise independent fieldelements, which correspond to elements Y , . . . , Y k ∈ [2 ℓ ] (where Y i = P ℓ j =1 j { a i ∈ F j } ) with the desiredmarginals.In doing so, a new issue arises when applying the identity tester (in ℓ distance) of Chan et al. [18]in Line 5. Note that we can no longer rely on a centralized uniformity testing algorithm (in ℓ distance),as we did in . This is because the resulting reference distribution defined by ( Z ( q ) , . . . , Z L ( q )) isno longer, in general, the uniform distribution u L , but some distribution ˜ q on [ L ] . Observe that this DRAFT1 distribution ˜ q is still fully known by the referee, who is aware of both q and the realization of the sharedrandomness (and therefore of Y , . . . , Y k ).To handle this issue, we observe that the testing algorithm in ℓ distance of Chan et al. does providea guarantee beyond uniformity testing, for the general question of identity testing in ℓ distance. It is,however, a guarantee which degrades with the ℓ norm of the reference distribution (in our case, ˜ q ). Theorem A.15 ([18, Proposition 3.1], with the improvement of [22, Lemma II.3]) . There exists analgorithm which, given distance parameter ε > , k ∈ N , and β > , satisfies the following. Given n samples from each of two unknown distributions q , q ′ ∈ ∆ k such that β ≥ min( k q k , k q ′ k ) , thealgorithm distinguishes between the cases that q = q ′ and k q − q ′ k > ε with probability at least / ,as long as n & β/γ . We note that the contribution from [22, Lemma II.3] is to explain how to replace the condition β ≥ max( k q k , k q ′ k ) from [18] by the weaker β ≥ min( k q k , k q ′ k ) . Further, one can as before amplifythe probability of success from / to any chosen constant, at the price of a constant factor in thesample complexity. We would like to apply this lemma to testing identity to the L -ary distribution ˜ q ,with distance parameter γ/ √ L and parameter β := k ˜ q k . The desired sample complexity would followif we had k ˜ q k . / √ L , since then we would get k ˜ q k ( γ/ √ L ) . √ Lγ . Of course, we cannot argue that k ˜ q k . / √ L with probability one over the choice of the randompartition. However, since F q ( q ) = u k , it is a simple exercise to check that, over this choice, E h k ˜ q k i = 1 / (5 k ) + (5 k − / (5 kL ) ≤ /L. Therefore, letting c ∈ (0 , be the constant from Theorem VI.2, we get by Markov’s inequality that k ˜ q k ≤ / ( √ cL ) with probability at least − c/ .Since we ran, in Line 8, the identity test with probability of failure δ ′ := c/ (2 + c ) , we have thefollowing. When p = q , each block outputs with probability at least θ := 12 · c − δ ′ )(1 − c − c − (1 − c δ ′ = c − c + 84( c + 2) while, when p is ε -far from q , the test for each block outputs with probability greater than θ := (1 − δ ′ ) c = 2 cc + 2 Recall that, in contrast to here, the knowledge of shared randomness by the referee was not used in Algorithm 6. DRAFT2 so that we have indeed θ > − θ . We then conclude the proof as that of Theorem VI.1, amplifying theprobabilities of success by invoking Lemma VI.5 and choosing a suitable N = Θ(1) . The total numberof public coins used is then at most N · ⌈ log(5 k ) ⌉ + ℓ ) = O (log k ) , as claimed. D. From uniformity to parameterized identity testing In this appendix, we explain how the existence of a distributed protocol for uniformity testing impliesthe existence of one for identity testing with roughly the same parameters, and further even implies onefor identity testing in the massively parameterized sense (“instance-optimal” in the vocabulary of Valiantand Valiant, who introduced it [43]). These two results will be seen as a straightforward consequenceof [27], which establishes the former reduction in the standard non-distributed setting; and of [12], whichimplies that massively parameterized identity testing reduces to “worst-case” identity testing. Specifically,we show the following: Proposition A.16. Suppose that there exists an ℓ -bit ( k, ε, δ ) -uniformity testing protocol π for n ( k, ℓ, ε, δ ) players. Then there exists an ℓ -bit ( k, ε, δ ) -identity testing protocol π ′ against any fixed distribution q (known to all players), for n (5 k, ℓ, ε, δ ) players.Furthermore, this reduction preserves the setting of randomness (i.e., private-coin protocols are mappedto private-coin protocols).Proof. We rely on the result of Goldreich [27], which describes a mapping F q : ∆ [ k ] → ∆ [5 k ] such that F q ( q ) = u [5 k ] and d TV (cid:0) F q ( p ) , u [5 k ] (cid:1) > ε for any p ∈ ∆ [ k ] ε -far from q . In more detail, thismapping proceeds in two stages: the first allows one to assume, at essentially no cost, that the referencedistribution q is “grained,” i.e. , such that all probabilities q ( i ) are a multiple of /m for some m . k .Then, the second mapping transforms a given m -grained distribution to the uniform distribution on analphabet of slightly larger cardinality. The resulting F q is the composition of these two mappings.Moreover, a crucial property of F q is that, given the knowledge of q , a sample from F q ( p ) can beefficiently simulated from a sample from p ; this implies the proposition. Massively parameterized setting, a terminology borrowed from property testing, refers here to the fact that the samplecomplexity depends not only on a single parameter k but a k -ary distribution q . In [27], Goldreich exhibits a randomized mapping that converts the problem from testing identity over domain of size k with proximity parameter ε to testing uniformity over a domain of size k ′ := k/α with proximity parameter ε ′ := (1 − α ) ε ,for every fixed choice of α ∈ (0 , . This mapping further preserves the success probability of the tester. Since the resultinguniformity testing problem has sample complexity Θ (cid:16) √ k ′ /ε ′ (cid:17) , the blowup factor / ( α (1 − α ) ) is minimized by α = 1 / . DRAFT3 Remark A.17 . The result above crucially assumes that every player has explicit knowledge of the referencedistribution q to be tested against, as this knowledge is necessary for them to simulate a sample from F q ( p ) given their sample from the unknown p . If only the referee R is assumed to know q , then theabove reduction does not go through.The previous reduction enables a distributed test for any identity testing problem using at most, roughly,as many players as that required for distributed uniformity testing. However, we can expect to use fewerplayers for specific distributions. Indeed, in the standard, non-distributed setting, Valiant and Valiantin [43] study a refined analysis termed the instance-optimal setting and showed that the sample complexityof testing identity to q is captured roughly by the / -quasinorm of a sub-function of q obtained asfollows: Assuming without loss of generality q ≥ q ≥ · · · ≥ q k ≥ , let t ∈ [ k ] be the largest integerthat P ki = t +1 q i ≥ ε , and let q ε = ( q , . . . , q t ) ( i.e. , removing the largest element and the “tail” of q ).The main result in [43] shows that the sample complexity of testing identity to q is upper and lowerbounded (up to constants) by max {k q ε/ k / /ε , /ε } and max {k q ε k / /ε , /ε } , respectively.However, it is not clear if the aforementioned reduction of Goldreich between identity and uniformitytesting preserves this parameterization of sample complexity for identity testing. In particular, the / -quasinorm characterization does not seem to be amenable to the same type of analysis as that underly-ing Proposition A.16. Interestingly, a different instance-optimal characterization due to Blais, Canonne,and Gur [12] admits such a reduction, enabling us to obtain the analogue of Proposition A.16 for thismassively parameterized setting.To state the result as parameterized by q (instead of k ), we will need the definition of a new functional, Φ( q , γ ) ; see [12, Section 6] for a discussion on basic properties of Φ and how it relates to notions suchas the sparsity of p and the functional k p − max γ k / defined in [43]. For a ∈ ℓ ( N ) and t ∈ (0 , ∞ ) , let κ a ( t ) := inf a ′ + a ′′ = a (cid:0) k a ′ k + t k a ′′ k (cid:1) and, for q ∈ ∆ N and any γ ∈ (0 , , let Φ( q , γ ) := 2 κ − q (1 − γ ) . It was observed in [12] that if q is supported on at most k elements, Φ( q , γ ) ≤ k for all γ ∈ (0 , .Moreover, the sample complexity of testing identity to q was shown there to be upper and lower bounded(again up to constants) by max(Φ( q , ε/ /ε , /ε ) and Φ( q , ε ) /ε , respectively. We are now in a positionto state our general reduction. DRAFT4 Proposition A.18. Suppose that there exists an ℓ -bit ( k, ε, δ ) -uniformity testing protocol π for n ( k, ℓ, ε, δ ) players. Then there exists an ℓ -bit ( k, ε, δ ) -identity testing protocol π ′ for any fixed reference distribution q (known to all players), for n (5(Φ( q , ε/ 9) + 1) , ℓ, ε/ , δ ) players.Further, this reduction preserves the setting of randomness (i.e., private-coin protocols are mapped toprivate-coin protocols).Proof. This strengthening of Proposition A.16 stems from the algorithm for identity testing given in [12],which at a high-level reduces testing identity to q of an (unknown) distribution p to testing identity of p | S q ( ε ) of q | S q ( ε ) , where S q ( ε ) is the ( ε/ -effective support of q ; along with checking that p also onlyputs probability mass roughly ε/ outside of S q ( ε ) . The key result of [12] relates this effective supportto the functional Φ defined above. They show (see [12, Section 7.2]) that for all q ∈ ∆ k and ε ∈ (0 , , | S q ( ε ) | ≤ Φ (cid:18) q , ε (cid:19) . (12)See Fig. 2 for an illustration. q ( i ) , p ( i ) ik k ε S q ( ε ) ε Fig. 2. The reference distribution q (in blue; assumed non-increasing without loss of generality) and the unknown distribution p (in red). By the reduction above, testing equality of p to q is tantamount to (i) determining S q ( ε ) , which depends only on q ; (ii) testing identity for the conditional distributions of p and q given S q ( ε ) , and (iii) testing that p assigns at most O ( ε ) probability to the complement of S q ( ε ) . The protocol π ′ then works as follows: Recall the ε -effective support of a distribution q is a minimal set of elements accounting for at least − ε probability massof q . DRAFT5 1) Given their knowledge of q and ε , all players (and the referee) compute S := S q ( ε ) . Consider thefollowing mapping G q : ∆ [ k ] → ∆ S ∪{⊥} . For any p ′ ∈ ∆ [ k ] , G q ( p ′ )( x ) = p ′ ( x ) , if x ∈ S, p ′ ([ k ] \ [ S ]) , if x = ⊥ . Note that all players have full knowledge of ˜ q := G q ( q ) . Further, each player, given their samplefrom the (unknown) p , can straightforwardly obtain a sample from ˜ p := G q ( p ) .2) All players (and the referee) compute k ′ := 5( | S | + 1) , and the mapping F ˜ q : ∆ S ∪{⊥} → ∆ k ′ (as inthe proof of Proposition A.16). From properties of F q described in the proof of Proposition A.16, F ˜ q (˜ q ) = u k ′ .3) Each player converts their sample from the (unknown) distribution ˜ p into a sample from the(unknown) distribution F ˜ q (˜ p ) . (Recall that this is possible given the knowledge of ˜ q , as statedin the proof of Proposition A.16.)4) The players and the referee execute the purported ℓ -bit uniformity testing protocol π on their samplesfrom F ˜ q (˜ p ) , with parameters ( k ′ , ε/ , δ ) . The output of π ′ is then that of π .If p = q , then ˜ p = ˜ q and thus F ˜ q (˜ p ) = F ˜ q (˜ q ) = u k ′ , so that the protocol π returns 1 with probabilityat least − δ . On the other hand, if d TV ( p , q ) > ε , then TV (˜ p , ˜ q ) = X x ∈ S | p ( x ) − q ( x ) | + (cid:12)(cid:12) p ( ¯ S ) − q ( ¯ S ) (cid:12)(cid:12) = 2d TV ( p , q ) − X x ∈ ¯ S | p ( x ) − q ( x ) | + (cid:12)(cid:12) p ( ¯ S ) − q ( ¯ S ) (cid:12)(cid:12) ≥ TV ( p , q ) − ( p ( ¯ S ) + q ( ¯ S )) + (cid:12)(cid:12) p ( ¯ S ) − q ( ¯ S ) (cid:12)(cid:12) = 2d TV ( p , q ) − p ( ¯ S ) , q ( ¯ S )) > ε − · ε ε i.e. , d TV (˜ p , ˜ q ) > ε/ . Recalling the guarantee of Goldreich’s reduction (as described in the proofof Proposition A.16), this in turns implies that d TV ( F ˜ q (˜ p ) , u k ′ ) ≥ (16 / · ε/ > ε/ , and thereforethe protocol π must return 0 with probability at least − δ .To conclude, in view of (12), the number of players required by π ′ is n ( k ′ , ℓ, ε/ , δ ) = n (5( | S q ( ε ) | + 1) , ℓ, ε/ , δ ) ≤ n (5(Φ( q , ε/ 9) + 1) , ℓ, ε/ , δ ) , as claimed. DRAFT6 R EFERENCES [1] J. Acharya, C. L. Canonne, C. Freitag, and H. Tyagi, “Inference under information constraints III: Local privacy constraints,”2019, in preparation. Preprint available at arXiv:abs/1808.02174. I-D, 3[2] ——, “Test without trust: Optimal locally private distribution testing,” in Proceedings of the 22nd International Conferenceon Artificial Intelligence and Statistics (AISTATS’19) , 2019, to appear. Full version available on arXiv (abs/1808.02174).B, B, B[3] J. Acharya, C. L. Canonne, and H. Tyagi, “Inference under information constraints I: Lower bounds from chi-squarecontraction,” 2018, in submission. Preprint available at arXiv:abs/1812.11476. I, I-A, I-B0a, I-D, III-C0b, V.4, V-C, VI[4] J. Acharya, C. Daskalakis, and G. C. Kamath, “Optimal Testing for Properties of Distributions,” in Advances in NeuralInformation Processing Systems 28 , C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett, and R. Garnett, Eds.Curran Associates, Inc., 2015, pp. 3577–3598. I-C[5] J. Acharya, Z. Sun, and H. Zhang, “Hadamard response: Estimating distributions privately, efficiently, and with littlecommunication,” ArXiV , vol. abs/1802.04705, 2018. I-C[6] R. Ahlswede and I. Csiszár, “Hypothesis testing with communication constraints,” IEEE Transactions on InformationTheory , vol. 32, no. 4, pp. 533–542, July 1986. I-C[7] N. Alon and S. Lovett, “Almost k-wise vs. k-wise independent permutations, and uniformity for general group actions,” Theory of Computing , vol. 9, pp. 559–577, 2013. VI.7[8] S. Balakrishnan and L. Wasserman, “Hypothesis testing for high-dimensional multinomials: A selective review,” The Annalsof Applied Statistics , vol. 12, no. 2, pp. 727–749, 2018. [Online]. Available: https://doi.org/10.1214/18-AOAS1155SF I-C[9] M. Balcan, A. Blum, S. Fine, and Y. Mansour, “Distributed learning, communication complexity and privacy,” in Proceedings of the 25th Conference on Learning Theory, COLT 2012 , ser. JMLR Proceedings, vol. 23. JMLR.org,2012, pp. 26.1–26.22. I-C[10] T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White, “Testing random variables for independence andidentity,” in , 2001, pp. 442–451. I-C[11] M. Bavarian, B. Ghazi, E. Haramaty, P. Kamath, R. L. Rivest, and M. Sudan, “The optimality of correlated sampling,” ArXiV , vol. abs/1612.01041, 2016. I-C[12] E. Blais, C. L. Canonne, and T. Gur, “Distribution testing lower bounds via reductions from communication complexity,”in Computational Complexity Conference , ser. LIPIcs, vol. 79. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017,pp. 28:1–28:40. I-C, 4, D, D, D, D[13] S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via thealternating direction method of multipliers,” Foundations and Trends in Machine Learning , vol. 3, no. 1, pp. 1–122, 2011.I-C[14] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff, “Communication lower bounds for statistical estimationproblems via a distributed data processing inequality,” in Symposium on Theory of Computing Conference, STOC’16 . ACM,2016, pp. 1011–1020. I-C[15] A. Z. Broder, “On the resemblance and containment of documents,” in Compression and Complexity of Sequences 1997.Proceedings . IEEE, 1997, pp. 21–29. I-C[16] C. L. Canonne, “A Survey on Distribution Testing: your data is Big. But is it Blue?” Electronic Colloquium onComputational Complexity (ECCC) , vol. 22, p. 63, Apr. 2015. I-C DRAFT7 [17] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld, “Testing shape restrictions of discrete distributions,” Theory of Computing Systems , pp. 1–59, 2017. [Online]. Available: http://dx.doi.org/10.1007/s00224-017-9785-6 VI, 5,VI[18] S. Chan, I. Diakonikolas, G. Valiant, and P. Valiant, “Optimal algorithms for testing closeness of discrete distributions,”in Proceedings of SODA , 2014, pp. 1193–1203. VI, 5, VI, 8, C, A.15, C[19] A. De, E. Mossel, and J. Neeman, “Non interactive simulation of correlated distributions is decidable,” in Proceedings ofSODA . SIAM, 2018, pp. 2728–2746. I-C[20] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price, “Sample-optimal identity testing with high probability,” ElectronicColloquium on Computational Complexity (ECCC) , vol. 24, p. 133, 2017. I-C, III-C0b, V-B[21] I. Diakonikolas, E. Grigorescu, J. Li, A. Natarajan, K. Onak, and L. Schmidt, “Communication-efficient distributed learningof discrete distributions,” in Advances in Neural Information Processing Systems 30 , 2017, pp. 6394–6404. I-C[22] I. Diakonikolas and D. M. Kane, “A new approach for testing properties of discrete distributions,” in . IEEE Computer Society, 2016. I-C, A.15, C[23] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax rates,” in . IEEE Computer Society, 2013, pp. 429–438. I-C[24] P. Gács and J. Körner, “Common information is far less than mutual information,” Problems of Control and InformationTheory , vol. 2, no. 2, pp. 149–162, 1973. I-C[25] A. Garg, T. Ma, and H. L. Nguyen, “On communication cost of distributed statistical estimation and dimensionality,” in Advances in Neural Information Processing Systems 27 , 2014, pp. 2726–2734. I-C[26] B. Ghazi, P. Kamath, and M. Sudan, “Decidability of non-interactive simulation of joint distributions,” in . IEEE Computer Society, 2016, pp. 545–554. I-C[27] O. Goldreich, “The uniform distribution is complete with respect to testing identity to a fixed distribution,” Electronic Colloquium on Computational Complexity (ECCC) , vol. 23, p. 15, 2016. [Online]. Available:http://eccc.hpi-web.de/report/2016/015 I-C, III-C0b, VI, D, D, 15[28] O. Goldreich and D. Ron, “On testing expansion in bounded-degree graphs,” Electronic Colloquium on ComputationalComplexity (ECCC), Tech. Rep. TR00-020, 2000. I-C[29] T. S. Han, “Hypothesis testing with multiterminal data compression,” IEEE Transactions on Information Theory , vol. 33,no. 6, pp. 759–772, November 1987. I-C[30] T. S. Han and S.-I. Amari, “Statistical inference under multiterminal data compression,” IEEE Transactions on InformationTheory , vol. 44, no. 6, pp. 2300–2324, October 1998. I-C[31] Y. Han, P. Mukherjee, A. Özgür, and T. Weissman, “Distributed statistical estimation of high-dimensional andnonparametric distributions with communication constraints,” Feb. 2018, talk given at ITA 2018. [Online]. Available:http://ita.ucsd.edu/workshop/18/files/abstract/abstract_2352.txt I-C[32] Y. Han, A. Özgür, and T. Weissman, “Geometric lower bounds for distributed parameter estimation under communicationconstraints,” in Proceedings of the 31st Conference on Learning Theory, COLT 2018 , ser. Proceedings of Machine LearningResearch, vol. 75. PMLR, 2018, pp. 3163–3188. I, I-A, I-B0a, I-C, V-C[33] T. Holenstein, “Parallel repetition: simplifications and the no-signaling case,” in Proceedings of the thirty-ninth annualACM symposium on Theory of computing . ACM, 2007, pp. 411–419. I-C[34] D. Huang and S. Meyn, “Generalized error exponents for small sample universal hypothesis testing,” IEEE Transactionson Information Theory , vol. 59, no. 12, pp. 8157–8181, 2013. III-C0b, V-B[35] S. Kamath and V. Anantharam, “Non-interactive simulation of joint distributions: The Hirschfeld-Gebelein-Rényi maximal DRAFT8 correlation and the hypercontractivity ribbon,” in Communication, Control, and Computing (Allerton), 2012 50th AnnualAllerton Conference on . IEEE, 2012, pp. 1057–1064. I-C[36] E. Kaplan, M. Naor, and O. Reingold, “Derandomized constructions of k -wise (almost) independent permutations,” Algorithmica , vol. 55, no. 1, pp. 113–133, 2009. VI.7[37] J. Kleinberg and E. Tardos, “Approximation algorithms for classification problems with pairwise relationships: Metriclabeling and Markov random fields,” Journal of the ACM (JACM) , vol. 49, no. 5, pp. 616–639, 2002. I-C[38] E. Kushilevitz and N. Nisan, Communication Complexity . New York, NY, USA: Cambridge University Press, 1997. IV-A[39] L. Paninski, “A coincidence-based test for uniformity given very sparsely sampled discrete data,” IEEE Transactions onInformation Theory , vol. 54, no. 10, pp. 4750–4755, 2008. I-C, III-C0b[40] R. Rubinfeld, “Taming big probability distributions,” XRDS: Crossroads, The ACM Magazine for Students , vol. 19, no. 1,p. 24, sep 2012. [Online]. Available: http://dx.doi.org/10.1145/2331042.2331052 I-C[41] O. Shamir, “Fundamental limits of online and distributed algorithms for statistical learning and estimation,” in Advancesin Neural Information Processing Systems 27 , 2014, pp. 163–171. I-C[42] G. Valiant and P. Valiant, “An automatic inequality prover and instance optimal identity testing,” in , 2014. 44[43] ——, “An automatic inequality prover and instance optimal identity testing,” SIAM Journal on Computing , vol. 46, no. 1,pp. 429–455, 2017. I-C, 4, VI.8, D, D[44] ——, “An automatic inequality prover and instance optimal identity testing,” SIAM Journal on Computing , vol. 46, no. 1,pp. 429–455, 2017, journal version of [42]. III-C0b[45] T. Watson, “Communication complexity of statistical distance,” TOCT , vol. 10, no. 1, pp. 2:1–2:11, 2018. I-C[46] A. Wyner, “The common information of two dependent random variables,” IEEE Transactions on Information Theory ,vol. 21, no. 2, pp. 163–179, 1975. I-C[47] A. Xu and M. Raginsky, “Information-theoretic lower bounds on Bayes risk in decentralized estimation,” IEEE Transactionson Information Theory , vol. 63, no. 3, pp. 1580–1600, 2017. I-C[48] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright, “Information-theoretic lower bounds for distributed statisticalestimation with communication constraints,” in Advances in Neural Information Processing Systems 26 , 2013, pp. 2328–2336. I-C, 2013, pp. 2328–2336. I-C