Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation
FFundamental Limits of Online and Distributed Algorithms forStatistical Learning and Estimation
Ohad ShamirWeizmann Institute of Science [email protected]
Abstract
Many machine learning approaches are characterized by information constraints on how they inter-act with the training data. These include memory and sequential access constraints (e.g. fast first-ordermethods to solve stochastic optimization problems); communication constraints (e.g. distributed learn-ing); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more.However, currently we have little understanding how such information constraints fundamentally affectour performance, independent of the learning problem semantics. For example, are there learning prob-lems where any algorithm which has small memory footprint (or can use any bounded number of bitsfrom each example, or has certain communication constraints) will perform worse than what is possiblewithout such constraints? In this paper, we describe how a single set of results implies positive answersto the above, for several different settings.
Information constraints play a key role in machine learning. Of course, the main constraint is the availabilityof only a finite data set, from which the learner is expected to generalize. However, many problems currentlyresearched in machine learning can be characterized as learning with additional information constraints,arising from the manner in which the learner may interact with the data. Some examples include: • Communication constraints in distributed learning:
There has been much work in recent years onlearning when the training data is distributed among several machines (with [14, 2, 28, 47, 25, 31,9, 17, 38] being just a few examples). Since the machines may work in parallel, this potentiallyallows significant computational speed-ups and the ability to cope with large datasets. On the flipside, communication rates between machines is typically much slower than their processing speeds,and a major challenge is to perform these learning tasks with minimal communication. • Memory constraints:
The standard implementation of many common learning tasks requires mem-ory which is super-linear in the data dimension. For example, principal component analysis (PCA)requires us to estimate eigenvectors of the data covariance matrix, whose size is quadratic in thedata dimension and can be prohibitive for high-dimensional data. Another example is kernel learn-ing, which requires manipulation of the Gram matrix, whose size is quadratic in the number of datapoints. There has been considerable effort in developing and analyzing algorithms for such problemswith reduced memory footprint (e.g. [43, 5, 10, 54, 49]).1 a r X i v : . [ c s . L G ] O c t Online learning constraints:
The need for fast and scalable learning algorithms has popularised theuse of online algorithms, which work by sequentially going over the training data, and incrementallyupdating a (usually small) state vector. Well-known special cases include gradient descent and mirrordescent algorithms (see e.g. [51, 52]). The requirement of sequentially passing over the data can beseen as a type of information constraint, whereas the small state these algorithms often maintain canbe seen as another type of memory constraint. • Partial-information constraints:
A common situation in machine learning is when the available datais corrupted, sanitized (e.g. due to privacy constraints), has missing features, or is otherwise partiallyaccessible. There has also been considerable interest in online learning with partial information,where the learner only gets partial feedback on his performance. This has been used to model variousproblems in web advertising, routing and multiclass learning. Perhaps the most well-known case is themulti-armed bandits problem [18, 8, 7], with many other variants being developed, such as contextualbandits [39, 41], combinatorial bandits [21], and more general models such as partial monitoring[18, 13].Although these examples come from very different domains, they all share the common feature of infor-mation constraints on how the learning algorithm can interact with the training data. In some specific cases(most notably, multi-armed bandits, and also in the context of certain distributed protocols, e.g. [9, 56])we can even formalize the price we pay for these constraints, in terms of degraded sample complexity orregret guarantees. However, we currently lack a general information-theoretic framework, which directlyquantifies how such constraints can impact performance. For example, are there cases where any onlinealgorithm, which goes over the data one-by-one, must have a worse sample complexity than (say) empiricalrisk minimization? Are there situations where a small memory footprint provably degrades the learningperformance? Can one quantify how a constraint of getting only a few bits from each example affects ourability to learn? To the best of our knowledge, there are currently no generic tools which allow us to answersuch questions, at least in the context of standard machine learning settings.In this paper, we make a first step in developing such a framework. We consider a general class of learn-ing processes, characterized only by information-theoretic constraints on how they may interact with thedata (and independent of any specific problem semantics). As special cases, these include online algorithmswith memory constraints, certain types of distributed algorithms, as well as online learning with partialinformation. We identify cases where any such algorithm must perform worse than what can be attainedwithout such information constraints. The tools developed allows us to establish several results for specificlearning problems: • We prove a new and generic regret lower bound for partial-information online learning with expertadvice. The lower bound is Ω( (cid:112) ( d/b ) T ) , where T is the number of rounds, d is the dimension of theloss/reward vector, and b is the number of bits b extracted from each loss vector. It is optimal up to log-factors (without further assumptions), and holds no matter what these b bits are – a single coordinate(as in multi-armed bandits), some information on several coordinates (studied in various settingsincluding semi-bandit feedback, bandits with side observations, and prediction with limited advice),a linear projection (as in bandit linear optimization), some feedback signal from a restricted set (asin partial monitoring) etc. Interestingly, it holds even if the online learner is allowed to adaptivelychoose which bits of the loss vector it can retain at each round. The lower bound quantifies in a verydirect way how information constraints in online learning degrade the attainable regret, independentof the problem semantics. 2 We prove that for some learning and estimation problems - in particular, sparse PCA and sparsecovariance estimation in R d - no online algorithm can attain statistically optimal performance (interms of sample complexity) with less than ˜Ω( d ) memory. To the best of our knowledge, this is thefirst formal example of a memory/sample complexity trade-off in a statistical learning setting. • We show that for similar types of problems, there are cases where no distributed algorithm (which isbased on a non-interactive or serial protocol on i.i.d. data) can attain optimal performance with lessthan ˜Ω( d ) communication per machine. To the best of our knowledge, this is the first formal exampleof a communication/sample complexity trade-off, in the regime where the communication budget islarger than the data dimension, and the examples at each machine come from the same underlyingdistribution. • We demonstrate the existence of simple (toy) stochastic optimization problems where any algorithmwhich uses memory linear in the dimension (e.g. stochastic gradient descent or mirror descent) cannotbe statistically optimal.
Related Work
In stochastic optimization, there has been much work on lower bounds for sequential algorithms, startingfrom the seminal work of [45], and including more recent works such as [1]. [48] also consider suchlower bounds from a more general information-theoretic perspective. However, these results all hold in an oracle model , where data is assumed to be made available in a specific form (such as a stochastic gradientestimate). As already pointed out in [45], this does not directly translate to the more common setting, wherewe are given a dataset and wish to run a simple sequential optimization procedure. Indeed, recent worksexploited this gap to get improved algorithms using more sophisticated oracles, such as the availability ofprox-mappings [46]. Moreover, we are not aware of cases where these lower bounds indicate a gap betweenthe attainable performance of any sequential algorithm and batch learning methods (such as empirical riskminimization).In the context of distributed learning and statistical estimation, information-theoretic lower bounds havebeen recently shown in the pioneering work [56]. Assuming communication budget constraints on differentmachines, the paper identifies cases where these constraints affect statistical performance. Our results (in thecontext of distributed learning) are very similar in spirit, but there are two important differences. First, theypertain to parametric estimation in R d , where the communication budget per machine is much smaller thanwhat is needed to even specify the answer ( O ( d ) bits). In contrast, our results pertain to simpler detectionproblems, where the answer requires only O (log( d )) bits, yet lead to non-trivial lower bounds even whenthe budget size is much larger (in some cases, much larger even than d ). The second difference is that theirwork focuses on distributed algorithms, while we address a more general class of algorithms, which includesother information-constrained settings. Strong lower bounds in the context of distributed learning have alsobeen shown in [9], but they do not apply to a regime where examples across machines come from the samedistribution, and where the communication budget is much larger than what is needed to specify the output.There are well-known lower bounds for multi-armed bandit problems and other online learning withpartial-information settings. However, they crucially depend on the semantics of the information feedbackconsidered. For example, the standard multi-armed bandit lower bound [8] pertain to a setting where wecan view a single coordinate of the loss vector, but doesn’t apply as-is when we can view more than onecoordinate (as in semi-bandit feedback [35, 6], bandits with side observations [42], or prediction with limitedadvice [50]), receive a linear projection (as in bandit linear optimization), or receive a different type of partial3eedback (such as in partial monitoring [20]). In contrast, our results are generic and can directly apply toany such setting.The inherent limitations of streaming and distributed algorithms, including memory and communicationconstraints, have been extensively studied within theoretical computer science (e.g. [4, 11, 23, 44, 12]).Unfortunately, almost all these results consider tasks unrelated to learning, and/or adversarially generateddata, and thus do not apply to statistical learning tasks, where the data is assumed to be drawn i.i.d. fromsome underlying distribution. [55, 27] do consider i.i.d. data, but focus on problems such as detecting graphconnectivity and counting distinct elements, and not learning problems such as those considered here. Onthe flip side, there are works on memory-efficient algorithms with formal guarantees for statistical problems(e.g. [43, 10, 34, 24]), but these do not consider lower bounds or provable trade-offs.Finally, there has been a line of works on hypothesis testing and statistical estimation with finite memory(see [36, 40, 32, 37] and references therein). However, the limitations shown in these works apply when therequired precision exceeds the amount of memory available, a regime which is usually relevant only whenthe data size is exponential in the memory size . In contrast, we do not rely on finite precision considerations. We begin with a few words about notation. We use bold-face letters (e.g. x ) to denote vectors, and let e j ∈ R d denote j -th standard basis vector. When convenient, we use the standard asymptotic notation O ( · ) , Ω( · ) , Θ( · ) to hide constants, and an additional ˜ sign (e.g. ˜ O ( · ) ) to also hide log-factors. log( · ) refersto the natural logarithm, and log ( · ) to the base-2 logarithm.Our main object of study is the following generic class of information-constrained algorithms: Definition 1 ( ( b, n, m ) Protocol) . Given access to a sequence of mn i.i.d. instances (vectors in R d ), analgorithm is a ( b, n, m ) protocol if it has the following form, for some functions f t returning an output of atmost b bits, and some function f : • For t = 1 , . . . , m – Let X t be a batch of n i.i.d. instances – Compute message W t = f t ( X t , W , W , . . . W t − ) • Return W = f ( W , . . . , W m ) Note that the functions { f t } mt =1 , f are completely arbitrary, may depend on m and can also be random-ized. The crucial assumption is that the outputs W t are constrained to be only b bits.At this stage, the definition above may appear quite abstract, so let us consider a few specific examples: • b -memory online protocols: Consider any algorithm which goes over examples one-by-one, and incre-mentally updates a state vector W t of bounded size b . We note that a majority of online learning andstochastic optimization algorithms have bounded memory. For example, for linear predictors, mostgradient-based algorithms maintain a state whose size is proportional to the size of the parameter vec-tor that is being optimized. Such algorithms correspond to ( b, n, m ) protocols where n = 1 ; W t is thestate vector after round t , with an update function f t depending only on W t − , and f depends onlyon W m . For example, suppose we have B bits of memory and try to estimate the mean of a random variable in [0 , . If we have B data points, then by Hoeffding’s inequality, we can estimate the mean up to accuracy O (2 − B ) , but the finite memory limits us toan accuracy of − B . Non-interactive and serial distributed algorithms:
There are m machines and each machine receivesan independent sample X t of size n . It then sends a message W t = f t ( X t ) (which here dependsonly on X t ). A centralized server then combines the messages to compute an output f ( W . . . W m ) .This includes for instance divide-and-conquer style algorithms proposed for distributed stochasticoptimization (e.g. [57, 58]). A serial variant of the above is when there are m machines, and one-by-one, each machine t broadcasts some information W t to the other machines, which depends on X t aswell as previous messages sent by machines , , . . . , ( t − . • Online learning with partial information:
This is a special case of ( b, , m ) protocols. We sequentiallyreceive d -dimensional loss vectors, and from each of these we can extract and use only b bits ofinformation, where b (cid:28) d . For example, this includes most types of multi-armed bandit problems. • Mini-batch Online learning algorithms:
The data is streamed one-by-one or in mini-batches of size n , with mn instances overall. An algorithm sequentially updates its state based on a b -dimensionalvector extracted from each example/batch (such as a gradient or gradient average), and returns a finalresult after all data is processed. This includes most gradient-based algorithms we are aware of, butalso distributed versions of these algorithms (such as parallelizing a mini-batch processing step as in[29, 25]).We note that our results can be generalized to allow the size of the messages W t to vary across t , and evento be chosen in a data-dependent manner.In our work, we contrast the performance attainable by any algorithm corresponding to such protocols,to constraint-free protocols which are allowed to interact with the sampled instances in any manner. Our results are based on a simple ‘hide-and-seek’ statistical estimation problem, for which we show a stronggap between the attainable performance of information-constrained protocols and constraint-free protocols.It is parameterized by a dimension d , bias ρ , and sample size mn , and defined as follows: Definition 2 (Hide-and-seek Problem) . Consider the set of product distributions { Pr j ( · ) } dj =1 over {− , } d defined via E x ∼ Pr j ( · ) [ x i ] = 2 ρ i = j for all coordinates i = 1 , . . . d . Given an i.i.d. sample of mn instancesgenerated from Pr j ( · ) , where j is unknown, detect j . In words, Pr j ( · ) corresponds to picking all coordinates other than j to be ± uniformly at random, andindependently picking coordinate j to be +1 with a higher probability (cid:0) + ρ (cid:1) . The goal is to detect thebiased coordinate j based on a sample.First, we note that without information constraints, it is easy to detect the biased coordinate with O (log( d ) /ρ ) instances. This is formalized in the following theorem, which is an immediate consequenceof Hoeffding’s inequality and a union bound: Theorem 1.
Consider the hide-and-seek problem defined earlier. Given mn samples, if ˜ J is the coordinatewith the highest empirical average, then Pr j ( ˜ J = j ) ≥ − d exp (cid:18) − mnρ (cid:19) .
5e now show that for this hide-and-seek problem, there is a large regime where detecting j is information-theoretically possible (by Thm. 1), but any information-constrained protocol will fail to do so with highprobability.We first show this for ( b, , m ) protocols (i.e. protocols which process one instance at a time, suchas bounded-memory online algorithms, and distributed algorithms where each machine holds a single in-stance): Theorem 2.
Consider the hide-and-seek problem on d > coordinates, with some bias ρ ≤ / and samplesize m . Then for any estimate ˜ J of the biased coordinate returned by any ( b, , m ) protocol, there existssome coordinate j such that Pr j ( ˜ J = j ) ≤ d + 21 (cid:114) mρ bd . The theorem implies that any algorithm corresponding to ( b, , m ) protocols requires sample size m ≥ Ω(( d/b ) /ρ ) to reliably detect some j . When b is polynomially smaller than d (e.g. a constant), we get an ex-ponential gap compared to constraint-free protocols, which only require O (log( d ) /ρ ) instances. Moreover,Thm. 2 is optimal up to log-factors: Consider a b -memory online algorithm, which splits the d coordinatesinto O ( d/b ) segments of O ( b ) coordinates each, and sequentially goes over the segments, each time us-ing ˜ O (1 /ρ ) independent instances to determine if one of the coordinates in each segment is biased by ρ (assuming ρ is not exponentially smaller than b , this can be done with O ( b ) memory by maintaining theempirical average of each coordinate). This will allow to detect the biased coordinate, using ˜ O (( d/b ) /ρ ) instances.We now turn to provide an analogous result for general ( b, n, m ) protocols (where n is possibly greaterthan ). However, it is a bit weaker in terms of the dependence on the bias parameter : Theorem 3.
Consider the hide-and-seek problem on d > coordinates, with some bias ρ ≤ / n andsample size mn . Then for any estimate ˜ J of the biased coordinate returned by any ( b, n, m ) protocol, thereexists some coordinate j such that Pr j ( ˜ J = j ) ≤ d + 5 (cid:115) mn min (cid:26) ρbd , ρ (cid:27) . The theorem implies that any ( b, n, m ) protocol will require a sample size mn which is at least Ω (cid:16) max (cid:110) ( d/b ) ρ , ρ (cid:111)(cid:17) in order to detect the biased coordinate. This is larger than the O (log( d ) /ρ ) in-stances required by constraint-free protocols whenever ρ > b log( d ) /d , and establishes a trade-off betweensample complexity and information complexities such as memory and communication in this regime.The proofs of our theorems appear in Appendix A. However, the technical details may obfuscate thehigh-level intuition, which we now turn to explain.From an information-theoretic viewpoint, our results are based on analyzing the mutual informationbetween j and W t in a graphical model as illustrated in figure 1. In this model, the unknown message j (i.e. the identity of the biased coordinate) is correlated with one of d independent binary-valued randomvectors (one for each coordinate across the data instances X t ). All these random vectors are noisy, and themutual information in bits between X tj and j can be shown to be on the order of nρ . Without informationconstraints, it follows that given m instantiations of X t , the total amount of information conveyed on j bythe data is Θ( mnρ ) , and if this quantity is larger than log( d ) , then there is enough information to uniquely The proof of Thm. 2 can be applied in the case n > , but the dependence on n is exponential - see the proof for details. 𝑋 𝑋 𝑗 𝑡 𝑋 𝑑𝑡 𝑊 𝑡 𝑗 ⋮ ⋮ Figure 1: Illustration of the rela-tionship between j , the coordinates , , . . . , j, . . . , d of the sample X t ,and the message W t . The coordinatesare independent of each other, and mostof them just output ± uniformly atrandom. Only X tj has a slightly dif-ferent distribution and hence containssome information on j .identify j . Note that no stronger bound can be established with standard statistical lower-bound techniques,since these do not consider information constraints internal to the algorithm used.Indeed, in our information-constrained setting there is an added complication, since the output W t canonly contain b bits. If b (cid:28) d , then W t cannot convey all the information on X t , . . . , X td . Moreover, it willlikely convey only little information if it doesn’t already “know” j . For example, W t may provide a little bitof information on all d random variables, but then the information conveyed on each (and in particular, therandom variable X tj which is correlated with j ) will be very small. Alternatively, W t may provide accurateinformation on O ( b ) coordinates, but since the relevant random variable X tj is not known, it is likely to‘miss’ it. The proof therefore relies on the following components: • No matter what, a ( b, n, m ) protocol cannot provide more than b/d bits of information (in expectation)on X tj , unless it already “knows” j . • Even if the mutual information between W t and X tj is only b/d , and the mutual information between X tj and j is nρ , standard information-theoretic tools such as the data processing inequality onlyimplies that the mutual information between W t and j is bounded by min { nρ , b/d } . We essentiallyprove a stronger information contraction bound, which is the product of the two terms O ( ρ b/d ) when n = 1 , and O ( nρb/d ) for general n . At a technical level, this is achieved by considering therelative entropy between the distributions of W t with and without a biased coordinate j , relating itto the χ -divergence between these distributions (using relatively recent analytic results on Csisz´arf-divergences [30], [53]), and performing algebraic manipulations to upper bound it by ρ times themutual information between W t and X tj , which is on average b/d as discussed earlier. This eventuallyleads to the mρ b/d term in Thm. 2, as well as Thm. 3 using somewhat different calculations. Consider the standard setting of learning with expert advice, defined as a game over T rounds, where eachround t a loss vector (cid:96) t ∈ [0 , d is chosen, and the learner (without knowing (cid:96) t ) needs to pick an action i t from a fixed set { , . . . , d } , after which the learner suffers loss (cid:96) t,i t . The goal of the learner is to minimize theregret in hindsight to the any fixed action i , (cid:80) Tt =1 (cid:96) t,i t − (cid:80) Tt =1 (cid:96) t,i . We are interested in partial informationvariants, where the learner doesn’t get to see and use (cid:96) t , but only some partial information on it. For example,in standard multi-armed bandits, the learner can only view (cid:96) t,i t .The following theorem is a corollary of Thm. 2, and we provide a proof in Appendix A.4.7 heorem 4. Suppose d > . For any ( b, , T ) protocol, there is an i.i.d. distribution over loss vectors (cid:96) t ∈ [0 , d such that for some numerical constant c , min j E (cid:34) T (cid:88) t =1 (cid:96) t,j t − T (cid:88) t =1 (cid:96) t,j (cid:35) ≥ c min (cid:40) T, (cid:114) db T (cid:41) . As a result, we get that for any algorithm with any partial information feedback model (where b bits areextracted from each d -dimensional loss vector), it is impossible to get regret lower than Ω( (cid:112) ( d/b ) T ) forsufficiently large T . Interestingly, this holds even if the algorithm is allowed to examine each loss vector (cid:96) t and choose which b bits of information it wishes to retain. In contrast, full-information algorithms (e.g.Hedge [33]) can get O ( (cid:112) log( d ) T ) regret. Without further assumptions on the feedback model, the boundis optimal up to log-factors, as shown by O ( (cid:112) ( d/b ) T ) upper bounds for linear or coordinate measurements(where b is the number of measurements or coordinates seen ) [3, 42, 50]. However, the lower bound ismore general and applies to any partial feedback model. For example, we immediately get an Ω( (cid:112) ( d/k ) T regret lower bound when we are allowed to view k coordinates instead of , corresponding to (say) the semi-bandit feedback model ([21]), the side-observation model of [42] with a fixed upper bound k on the numberof side-observations. In partial monitoring ([20]), we get a Ω( d/k ) lower bound where k is the logarithmof the feedback matrix width. In learning with partially observed attributes (e.g. [22]), a simple reductionimplies an Ω( (cid:112) ( d/k ) T ) lower bound when we are constrained to view at most k features of each example. We now turn to consider an example from stochastic optimization, where our goal is to approximately min-imize F ( h ) = E Z [ f ( h ; Z )] given access to m i.i.d. instantiations of Z , whose distribution is unknown.This setting has received much attention in recent years, and can be used to model many statistical learningproblems. In this section, we show a stochastic optimization problem where information-constrained pro-tocols provably pay a performance price compared to non-constrained algorithms. We emphasize that it isgoing to be a very simple toy problem, and is not meant to represent anything realistic. We present it for tworeasons: First, it illustrates another type of situation where information-constrained protocols may fail (inparticular, problems involving matrices). Second, the intuition of the construction is also used in the morerealistic problem of sparse PCA and covariance estimation, considered in the next section.The construction is as follows: Suppose we wish to solve min ( w , v ) F ( w , v ) = E Z [ f (( w , v ); Z )] , where f (( w , v ); Z ) = w (cid:62) Z v , Z ∈ [ − , +1] d × d and w , v range over all vectors in the simplex (i.e. w i , v i ≥ and (cid:80) di =1 w i = (cid:80) di =1 v i = 1 ). A minimizerof F ( w , v ) is ( e i ∗ , e j ∗ ) , where ( i ∗ , j ∗ ) are indices of the matrix entry with minimal mean. Moreover, by astandard concentration of measure argument, given m i.i.d. instantiations Z , . . . , Z m from any distributionover Z , then the solution ( e ˜ I , e ˜ J ) , where ( ˜ I, ˜ J ) = arg min i,j m (cid:80) mt =1 Z ti,j are the indices of the entry withempirically smallest mean, satisfies F ( e ˜ I , e ˜ J ) ≤ min w , v F ( w , v )+ O (cid:16)(cid:112) log( d ) /m (cid:17) with high probability.However, computing ( ˜ I, ˜ J ) as above requires us to track d empirical means, which may be expensivewhen d is large. If instead we constrain ourselves to ( b, , m ) protocols where b = O ( d ) (e.g. any sort ofstochastic gradient method optimization algorithm, whose memory is linear in the number of parameters), Strictly speaking, if the losses are continuous-valued, these require arbitrary-precision measurements, but in any practicalimplementation we can assume the losses and measurements are discrete.
Ω(min { , (cid:112) d/m } ) on the expected error, which is muchhigher than the O ( (cid:112) log( d ) /m ) upper bound for constraint-free protocols. This claim is a straightforwardconsequence of Thm. 2: We consider distributions where Z ∈ {− , +1 } d × d with probability , each ofthe d entries is chosen independently, and E [ Z ] is zero except some coordinate ( i ∗ , j ∗ ) where it equals O ( (cid:112) d/m ) . For such distributions, getting optimization error smaller than O ( (cid:112) d/m ) reduces to detecting ( i ∗ , j ∗ ) , and this in turn reduces to the hide-and-seek problem defined earlier, over d coordinates and abias ρ = O ( (cid:112) d/m ) . However, Thm. 2 shows that no ( b, , m ) protocol (where b = O ( d ) ) will succeed if mdρ (cid:28) d , which indeed happens if ρ is small enough.Similar kind of gaps can be shown using Thm. 3 for general ( b, n, m ) protocols, which apply to anyspecial case such as non-interactive distributed learning. The sparse PCA problem ([59]) is a standard and well-known statistical estimation problem, defined asfollows: We are given an i.i.d. sample of vectors x ∈ R d , and we assume that there is some direction,corresponding to some sparse vector v (of cardinality at most k ), such that the variance E [( v (cid:62) x ) ] alongthat direction is larger than at any other direction. Our goal is to find that direction.We will focus here on the simplest possible form of this problem, where the maximizing direction v is assumed to be -sparse, i.e. there are only non-zero coordinates v i , v j . In that case, E [( v (cid:62) x ) ] = v E [ x ] + v E [ x ] + 2 v v E [ x i x j ] . Following previous work (e.g. [15]), we even assume that E [ x i ] = 1 for all i , in which case the sparse PCA problem reduces to detecting a coordinate pair ( i ∗ , j ∗ ) , i ∗ < j ∗ forwhich x i ∗ , x j ∗ are maximally correlated. A special case is a simple and natural sparse covariance estimationproblem ([16, 19]), where we assume that all covariates are uncorrelated ( E [ x i x j ] = 0 ) except for a uniquecorrelated pair of covariates ( i ∗ , j ∗ ) which we need to detect.This setting bears a resemblance to the example seen in the context of stochastic optimization in section4.2: We have a d × d stochastic matrix xx (cid:62) , and we need to detect an off-diagonal biased entry at location ( i ∗ , j ∗ ) . Unfortunately, these stochastic matrices are rank-1, and do not have independent entries as in theexample considered in section 4.2. Instead, we use a more delicate construction, relying on distributionssupported on sparse vectors. The intuition is that then each instantiation of xx (cid:62) is sparse, and the situationcan be reduced to a variant of our hide-and-seek problem where only a few coordinates are non-zero at atime. The theorem below establishes performance gaps between constraint-free protocols (in particular, asimple plug-in estimator), and any ( b, n, m ) protocol for a specific choice of n , or any b -memory onlineprotocol (See Sec. 2). Theorem 5.
Consider the class of -sparse PCA (or covariance estimation) problems in d ≥ dimensionsas described above, and all distributions such that:1. E [ x i ] = 1 for all i .2. For a unique pair of distinct coordinates ( i ∗ , j ∗ ) , it holds that E [ x i ∗ x j ∗ ] = τ > , whereas E [ x i x j ] =0 for all distinct coordinate pairs ( i, j ) (cid:54) = ( i ∗ , j ∗ ) .3. For any i < j , if (cid:103) x i x j is the empirical average of x i x j over m i.i.d. instances, then Pr (cid:0) | (cid:103) x i x j − E [ x i x j ] | ≥ τ (cid:1) ≤ (cid:0) − mτ / (cid:1) .Then the following holds: Let ( ˜ I, ˜ J ) = arg max i This research is supported by the Intel ICRI-CI Institute, Israel Science Foundation grant 425/13, and an FP7Marie Curie CIG grant. We thank John Duchi, Yevgeny Seldin and Yuchen Zhang for helpful comments. References [1] A. Agarwal, P. Bartlett, P. Ravikumar, and M. Wainwright. Information-theoretic lower bounds onthe oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on ,58(5):3235–3249, 2012.[2] A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford. A reliable effective terascale linear learningsystem. arXiv preprint arXiv:1110.4198 , 2011.[3] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT , 2010.[4] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequencymoments. In STOC , 1996.[5] R. Arora, A. Cotter, and N. Srebro. Stochastic optimization of pca with capped msg. In NIPS , 2013.[6] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Minimax policies for combinatorial prediction games. In COLT , 2011.[7] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning , 47(2-3):235–256, 2002.[8] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing , 32(1):48–77, 2002.[9] M. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity andprivacy. In COLT , 2012. 1110] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental pca. In NIPS ,2013.[11] Z. Bar-Yossef, T. Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to datastream and communication complexity. In FOCS , 2002.[12] B. Barak, M. Braverman, X. Chen, and A. Rao. How to compress interactive communication. In STOC , 2010.[13] G. Bart´ok, D. Foster, D. P´al, A. Rakhlin, and C. Szepesv´ari. Partial monitoring – classification, regretbounds, and algorithms. 2013.[14] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributedapproaches . Cambridge University Press, 2011.[15] A. Berthet and P. Rigollet. Complexity theoretic lower bounds for sparse principal component detec-tion. In COLT , 2013.[16] J. Bien and R. Tibshirani. Sparse estimation of a covariance matrix. Biometrika , 98(4):807–820, 2011.[17] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learningvia the alternating direction method of multipliers. Foundations and Trends R (cid:13) in Machine Learning ,3(1):1–122, 2011.[18] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed banditproblems. Foundations and Trends in Machine Learning , 5(1):1–122, 2012.[19] T. Cai and W. Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of theAmerican Statistical Association , 106(494), 2011.[20] N. Cesa-Bianchi and L. Gabor. Prediction, learning, and games . Cambridge University Press, 2006.[21] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences ,78(5):1404–1422, 2012.[22] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Efficient learning with partially observed at-tributes. The Journal of Machine Learning Research , 12:2857–2878, 2011.[23] A. Chakrabarti, S. Khot, and X. Sun. Near-optimal lower bounds on the multi-party communicationcomplexity of set disjointness. In CCC , 2003.[24] S. Chien, K. Ligett, and A. McGregor. Space-efficient estimation of robust statistics and distributiontesting. In ICS , 2010.[25] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gra-dient methods. In NIPS , 2011.[26] T. Cover and J. Thomas. Elements of information theory . John Wiley & Sons, 2006.[27] M. Crouch, A. McGregor, and D. Woodruff. Stochastic streams: Sample complexity vs. space com-plexity. In MASSIVE , 2013. 1228] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction. In ICML ,2011.[29] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction usingmini-batches. The Journal of Machine Learning Research , 13:165–202, 2012.[30] S. S. Dragomir. Upper and lower bounds for Csisz´ar’s f-divergence in terms of the Kullback-Leiblerdistance and applications. In Inequalities for Csisz´ar f-Divergence in Information Theory . RGMIAMonographs, 2000.[31] J. Duchi, A. Agarwal, and M. Wainwright. Dual averaging for distributed optimization: convergenceanalysis and network scaling. Automatic Control, IEEE Transactions on , 57(3):592–606, 2012.[32] E. Ertin and L. Potter. Sequential detection with limited memory. In Statistical Signal Processing,2003 IEEE Workshop on , pages 585–588, 2003.[33] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an applicationto boosting. Journal of Computer and System Sciences , 55(1):119–139, 1997.[34] S. Guha and A. McGregor. Space-efficient sampling. In AISTATS , 2007.[35] A. Gy¨orgy, T. Linder, G. Lugosi, and G. Ottucs´ak. The on-line shortest path problem under partialmonitoring. Journal of Machine Learning Research , 8:2369–2403, 2007.[36] M. Hellman and T. Cover. Learning with finite memory. Annals of Mathematical Statistics , pages765–782, 1970.[37] L. Kontorovich. Statistical estimation with bounded memory. Statistics and Computing , 22(5):1155–1164, 2012.[38] A. Kyrola, D. Bickson, C. Guestrin, and J. Bradley. Parallel coordinate descent for l1-regularized lossminimization. In ICML) , 2011.[39] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information.In NIPS , 2007.[40] F. Leighton and R. Rivest. Estimating a probability using finite memory. Information Theory, IEEETransactions on , 32(6):733–742, 1986.[41] L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to personalized news articlerecommendation. In WWW , 2010.[42] S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations. In NIPS , 2011.[43] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming pca. In NIPS , 2013.[44] S. Muthukrishnan. Data streams: Algorithms and applications . Now Publishers Inc, 2005.[45] A. Nemirovsky and D. Yudin. Problem Complexity and Method Efficiency in Optimization . Wiley-Interscience, 1983. 1346] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming ,103(1):127–152, 2005.[47] F. Niu, B. Recht, C. R´e, and S. Wright. Hogwild: A lock-free approach to parallelizing stochasticgradient descent. In NIPS , 2011.[48] M. Raginsky and A. Rakhlin. Information-based complexity, feedback and dynamics in convex pro-gramming. Information Theory, IEEE Transactions on , 57(10):7036–7056, 2011.[49] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS , 2007.[50] Y. Seldin, P. Bartlett, K. Crammer, and Y. Abbasi-Yadkori. Prediction with limited advice and multi-armed bandits with paid observations. In ICML , 2014.[51] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Ma-chine Learning , 4(2):107–194, 2011.[52] S. Sra, S. Nowozin, and S. Wright. Optimization for Machine Learning . Mit Press, 2011.[53] I. Taneja and P. Kumar. Relative information of type s, Csisz´ar’s f-divergence, and information in-equalities. Inf. Sci. , 166(1-4):105–125, 2004.[54] C. Williams and M. Seeger. Using the nystr¨om method to speed up kernel machines. In NIPS , 2001.[55] D. Woodruff. The average-case complexity of counting distinct elements. In ICDT , 2009.[56] Y. Zhang, J. Duchi, M. Jordan, and M. Wainwright. Information-theoretic lower bounds for distributedstatistical estimation with communication constraints. In NIPS , 2013.[57] Y. Zhang, J. Duchi, and M. Wainwright. Communication-efficient algorithms for statistical optimiza-tion. In NIPS , 2012.[58] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression. In COLT , 2013.[59] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computationaland graphical statistics , 15(2):265–286, 2006. A Proofs The proofs use several standard quantities and results from information theory – see Appendix B for moredetails. They also make use of a several auxiliary lemmas (presented in Subsection A.1), including a simplebut key lemma (Lemma 6) which quantifies how information-constrained protocols cannot provide informa-tion on all coordinates simultaneously. A.1 Auxiliary Lemmas Lemma 1. Suppose that d > , and for some fixed distribution Pr ( · ) over the messages w , . . . , w m computed by an information-constrained protocol, it holds that (cid:118)(cid:117)(cid:117)(cid:116) d d (cid:88) j =1 D kl (Pr ( w . . . w m ) || Pr j ( w . . . w m )) ≤ B. hen there exist some j such that Pr( ˜ J = j ) ≤ d + 2 B. Proof. By concavity of the square root, we have (cid:118)(cid:117)(cid:117)(cid:116) d d (cid:88) j =1 D kl (Pr ( w . . . w m ) || Pr j ( w . . . w m )) ≥ d d (cid:88) j =1 (cid:113) D kl (Pr ( w . . . w m ) || Pr j ( w . . . w m )) . Using Pinsker’s inequality and the fact that ˜ J is some function of the messages w , . . . , w m (independent ofthe data distribution), this is at least d d (cid:88) j =1 (cid:88) w ...w m (cid:12)(cid:12) Pr ( w . . . w m ) − Pr j ( w . . . w m ) (cid:12)(cid:12) ≥ d d (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) w ...w m (cid:0) Pr ( w . . . w m ) − Pr j ( w . . . w m ) (cid:1) Pr (cid:16) ˜ J | w . . . w m (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ d d (cid:88) j =1 | Pr ( ˜ J = j ) − Pr j ( ˜ J = j ) | . Thus, we may assume that d d (cid:88) j =1 | Pr ( ˜ J = j ) − Pr j ( ˜ J = j ) | ≤ B. The argument now uses a basic variant of the probabilistic method. If the expression above is at most B ,then for at least d/ values of j , it holds that | Pr ( ˜ J = j ) − Pr j ( ˜ J = j ) | ≤ B . Also, since (cid:80) dj =1 Pr ( ˜ J = j ) = 1 , then for at least d/ values of j , it holds that Pr ( ˜ J = j ) ≤ /d . Combining the two observations,and assuming that d > , it means there must exist some value of j such that | Pr ( ˜ J ) − Pr j ( ˜ J = j ) | ≤ B , as well as Pr ( ˜ J = j ) ≤ /d , hence Pr j ( ˜ J = j ) ≤ d + 2 B as required. Lemma 2. Let p, q be distributions over a product domain A × A × . . . × A d , where each A i is a finiteset. Suppose that for some j ∈ { , . . . , d } , the following inequality holds for all z = ( z , . . . , z d ) ∈ A × . . . × A d : p ( { z i } i (cid:54) = j | z j ) = q ( { z i } i (cid:54) = j | z j ) . Also, let E be an event such that p ( E | z ) = q ( E | z ) for all z . Then p ( E ) = (cid:88) z j p ( z j ) q ( E | z j ) . roof. p ( E ) = (cid:88) z p ( z ) p ( E | z ) = (cid:88) z p ( z ) q ( E | z )= (cid:88) z j p ( z j ) (cid:88) { z i } i (cid:54) = j p ( { z j } i (cid:54) = j | z j ) q ( E | z j , { z i } i (cid:54) = j )= (cid:88) z j p ( z j ) (cid:88) { z i } i (cid:54) = j q ( { z j } i (cid:54) = j | z j ) q ( E | z j , { z i } i (cid:54) = j )= (cid:88) z j p ( z j ) q ( E | z j ) . Lemma 3 ([30], Proposition 1) . Let p, q be two distributions on a discrete set, such that max x p ( x ) q ( x ) ≤ c .Then D kl ( p ( · ) || q ( · )) ≤ c D kl ( q ( · ) || p ( · )) . Lemma 4 ([30], Proposition 2 and Remark 4) . Let p, q be two distributions on a discrete set, such that max x p ( x ) q ( x ) ≤ c . Also, let D χ ( p ( · ) || q ( · )) = (cid:80) x ( p ( x ) − q ( x )) q ( x ) denote the χ -divergence between the distribu-tions p, q . Then D kl ( p ( · ) || q ( · )) ≤ D χ ( p ( · ) || q ( · )) ≤ c D kl ( p ( · ) || q ( · )) . Lemma 5. Suppose we throw n balls independently and uniformly at random into d > bins, andlet K , . . . K d denote the number of balls in each of the d bins. Then for any (cid:15) ≥ such that (cid:15) ≤ min { , 12 log( d ) , d n } , it holds that E (cid:20) exp (cid:18) (cid:15) max j K j (cid:19)(cid:21) < . Proof. Each K j can be written as (cid:80) ni =1 ( ball i fell into bin j ) , and has expectation n/d . Therefore, by astandard multiplicative Chernoff bound, for any γ ≥ , Pr (cid:16) K j > (1 + γ ) nd (cid:17) ≤ exp (cid:18) − γ γ ) nd (cid:19) . By a union bound, this implies that Pr (cid:18) max j K j > (1 + γ ) nd (cid:19) ≤ d (cid:88) j =1 Pr (cid:16) K j > (1 + γ ) nd (cid:17) ≤ d exp (cid:18) − γ γ ) nd (cid:19) . In particular, if γ + 1 ≥ , we can upper bound the above by the simpler expression exp( − (1 + γ ) n/ d ) .Letting τ = γ + 1 , we get that for any τ ≥ , Pr (cid:18) max j K j > τ nd (cid:19) ≤ d exp (cid:16) − τ n d (cid:17) . (1)16efine c = max { , d (cid:15) } . Using the inequality above and the non-negativity of exp( (cid:15) max j K j ) , we have E (cid:20) exp( (cid:15) max j K j ) (cid:21) = (cid:90) ∞ t =0 Pr (cid:18) exp( (cid:15) max j K j ) ≥ t (cid:19) dt ≤ c + (cid:90) ∞ t = c Pr (cid:18) exp( (cid:15) max j K j ) ≥ t (cid:19) dt = c + (cid:90) ∞ t = c Pr (cid:18) max j K j ≥ log( t ) (cid:15) (cid:19) dt = c + (cid:90) ∞ t = c Pr (cid:18) max j K j ≥ log( t ) d(cid:15)n nd (cid:19) dt Since we assume (cid:15) ≤ d/ n and c ≥ , it holds that exp(6 (cid:15)n/d ) ≤ exp(2) < ≤ c , which implies log( c ) d/(cid:15)n ≥ . Therefore, for any t ≥ c , it holds that log( t ) d/(cid:15)n ≥ . This allows us to use Eq. (1) toupper bound the expression above by c + d (cid:90) ∞ t = c exp (cid:18) − log( t ) d (cid:15)n nd (cid:19) dt = c + d (cid:90) ∞ t = c t − / (cid:15) dt. Since we assume (cid:15) ≤ / , we have / (3 (cid:15) ) ≥ , and therefore we can solve the integration to get c + d (cid:15) − c − (cid:15) ≤ c + dc − (cid:15) . Using the value of c , and since − (cid:15) ≤ − , this is at most max { , d (cid:15) } + d ∗ (cid:0) d (cid:15) (cid:1) − (cid:15) = max { , d (cid:15) } + d (cid:15) . Since (cid:15) ≤ / d ) , this is at most max { , exp(3 / } + exp(3 / < as required. Lemma 6. Let Z , . . . , Z d be independent random variables, and let W be a random variable which cantake at most b values. Then d d (cid:88) j =1 I ( W ; Z j ) ≤ bd . Proof. We have d d (cid:88) j =1 I ( W ; Z j ) = 1 d d (cid:88) j =1 ( H ( Z j ) − H ( Z j | W )) . (cid:80) dj =1 H ( Z j | W ) ≥ H ( Z . . . , Z d | W ) , this is at most d d (cid:88) j =1 H ( Z j ) − d H ( Z . . . Z d | W )= 1 d d (cid:88) j =1 H ( Z j ) − d ( H ( Z . . . Z d ) − I ( Z . . . Z d ; W ))= 1 d I ( Z . . . Z d ; W ) + 1 d d (cid:88) j =1 H ( Z j ) − H ( Z . . . Z d ) . (2)Since Z . . . Z d are independent, (cid:80) dj =1 H ( Z j ) = H ( Z . . . Z d ) , hence the above equals d I ( Z . . . Z d ; W ) = 1 d ( H ( W ) − H ( W | Z . . . Z d )) ≤ d H ( W ) , which is at most b/d since W is only allowed to have b values. A.2 Proof of Thm. 2 We will actually prove a more general result, stating that for any ( b, n, m ) protocol, Pr j ( ˜ J = j ) ≤ d + 14 . (cid:114) mn n ρ bd . The result stated in the theorem follows in the case n = 1 .The proof builds on the auxiliary lemmas presented in Appendix A.1.On top of the distributions Pr j ( · ) defined in the hide-and-seek problem (Definition 2), we define anadditional ‘reference’ distribution Pr ( · ) , which corresponds to the instances x chosen uniformly at randomfrom {− , +1 } d (i.e. there is no biased coordinate).Let w , . . . , w m denote the messages computed by the protocol. It is enough to prove that d d (cid:88) j =1 D kl (cid:0) Pr ( w . . . w m ) (cid:12)(cid:12)(cid:12)(cid:12) Pr j ( w . . . w m ) (cid:1) ≤ mn n ρ b/d, (3)since then by applying Lemma 1, we get that for some j , Pr j ( ˜ J = j ) ≤ (3 /d ) + 2 (cid:112) mn n ρ b/d ≤ (3 /d ) + 14 . (cid:112) mn n ρ b/d as required.Using the chain rule, the left hand side in Eq. (3) equals d d (cid:88) j =1 m (cid:88) t =1 E w ...w t − ∼ Pr (cid:2) D kl (cid:0) Pr ( w t | w . . . w t − ) || Pr j ( w t | w . . . w t − ) (cid:1)(cid:3) = 2 m (cid:88) t =1 E w ...w t − ∼ Pr d d (cid:88) j =1 D kl (cid:0) Pr ( w t | w . . . w t − ) || Pr j ( w t | w . . . w t − ) (cid:1) (4)18et us focus on a particular choice of t and values w . . . w t − . To simplify the presentation, we drop the t superscript from the message w t , and denote the previous messages w . . . w t − as ˆ w . Thus, we considerthe quantity d d (cid:88) j =1 D kl (Pr ( w | ˆ w ) || Pr j ( w | ˆ w )) . (5)Recall that w is some function of ˆ w and a set of n independent instances received in the current round. Let x j denote the vector of values at coordinate j across these n instances. Clearly, under Pr j , every x i for i (cid:54) = j is uniformly distributed on {− , +1 } n , whereas each entry of x j equals with probability + ρ , and − otherwise.First, we argue that by Lemma 2, for any w, ˆ w , we have Pr j ( w | ˆ w ) = Pr ( w | ˆ w ) (cid:88) x j Pr j ( x j | ˆ w ) = (cid:88) x j Pr ( w | ˆ w )Pr j ( x j | ˆ w ) = (cid:88) x j Pr ( w | ˆ w )Pr j ( x j ) . (6)This follows by applying the lemma on p ( · ) = Pr j ( ·| ˆ w ) , q ( · ) = Pr ( ·| ˆ w ) and A i = {− , +1 } n (i.e. thevector of values at a single coordinate i across the n data points), and noting the x j is independent of ˆ w . Thelemma’s conditions are satisfied since x i for i (cid:54) = j has the same distribution under Pr ( ·| ˆ w ) and Pr j ( ·| ˆ w ) ,and also w is only a function of x . . . x d and ˆ w .Using Lemma 3 and Lemma 4, we have the following. D kl (Pr ( w | ˆ w ) || Pr j ( w | ˆ w )) ≤ max w (cid:18) Pr ( w | ˆ w )Pr j ( w | ˆ w ) (cid:19) D kl (Pr j ( w | ˆ w ) || Pr ( w | ˆ w )) ≤ max w (cid:18) Pr ( w | ˆ w )Pr j ( w | ˆ w ) (cid:19) D χ (Pr j ( w | ˆ w ) || Pr ( w | ˆ w ))= max w (cid:18) Pr ( w | ˆ w )Pr j ( w | ˆ w ) (cid:19) (cid:88) w (Pr j ( w | ˆ w ) − Pr ( w | ˆ w )) Pr ( w | ˆ w ) (7)Let us consider the max term and the sum seperately. Using Eq. (6) and the fact that ρ ≤ / n , we have max w (cid:18) Pr ( w | ˆ w )Pr j ( w | ˆ w ) (cid:19) = max w (cid:32) (cid:80) x j Pr ( w | x j , ˆ w )Pr ( x j ) (cid:80) x j Pr ( w | x j , ˆ w )Pr j ( x j ) (cid:33) ≤ max x j (cid:18) Pr ( x j )Pr j ( x j ) (cid:19) = (cid:18) / / − ρ (cid:19) n ≤ (1 + 4 ρ ) n ≤ (1 + 1 /n ) n ≤ exp(1) . (8)19s to the sum term in Eq. (7), using Eq. (6) and the Cauchy-Schwartz inequality, we have (cid:88) w (Pr j ( w | ˆ w ) − Pr ( w | ˆ w )) Pr ( w | ˆ w ) = (cid:88) w (cid:16)(cid:80) x j Pr ( w | x j , ˆ w ) (Pr j ( x j ) − Pr ( x j )) (cid:17) Pr ( w | ˆ w )= (cid:88) w (cid:16)(cid:80) x j (Pr ( w | x j , ˆ w ) − Pr ( w | ˆ w )) (Pr j ( x j ) − Pr ( x j )) (cid:17) Pr ( w | ˆ w ) ≤ (cid:88) w (cid:80) x j (Pr ( w | x j , ˆ w ) − Pr ( w | ˆ w )) (cid:80) x j (Pr j ( x j ) − Pr ( x j )) Pr ( w | ˆ w )= (cid:88) x j (Pr j ( x j ) − Pr ( x j )) (cid:88) x j D χ (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w )) . (9)where we used the definition of χ -divergence as specified in Lemma 4. Again, we will consider each sumseparately. Applying Lemma 4 and Eq. (6), we have D χ (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w )) ≤ w (cid:18) Pr ( w | x j , ˆ w )Pr ( w | ˆ w ) (cid:19) D kl (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w ))= 2 max w (cid:32) Pr ( w | x j , ˆ w ) (cid:80) x j Pr ( w | x j , ˆ w )Pr ( x j ) (cid:33) D kl (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w ))= 2 max w (cid:32) Pr ( w | x j , ˆ w ) n (cid:80) x j Pr ( w | x j , ˆ w ) (cid:33) D kl (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w )) ≤ n +1 D kl (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w )) (10)Moreover, by definition of Pr and Pr j , and using the fact that each coordinate of x j takes values in {− , +1 } , we have (cid:88) x j (Pr j ( x j ) − Pr ( x j )) = (cid:88) x j (cid:32) n (cid:89) i =1 (cid:18) 12 + ρx j,i (cid:19) − n (cid:33) = 14 n (cid:88) x j (cid:32) n (cid:89) i =1 (1 + 2 ρx j,i ) − (cid:33) = 14 n (cid:88) x j (cid:32) n (cid:89) i =1 (1 + 2 ρx j,i ) − n (cid:89) i =1 (1 + 2 ρx j,i ) + 1 (cid:33) = 14 n n (cid:89) i =1 (cid:88) x j,i (1 + 2 ρx j,i ) − n (cid:89) i =1 (cid:88) x j,i (1 + 2 ρx j,i ) + 2 n = 14 n (cid:0) (2 + 8 ρ ) n − n +1 + 2 n (cid:1) = 12 n (cid:0) (1 + 4 ρ ) n − (cid:1) = 12 n (cid:18)(cid:18) nρ n (cid:19) n − (cid:19) ≤ n (cid:0) exp(4 nρ ) − (cid:1) ≤ . n nρ , (11)where in the last inequality we used the fact that nρ ≤ n (1 / n ) ≤ . , and exp( x ) ≤ . x forany x ∈ [0 , . . Plugging in Eq. (10) and Eq. (11) back into Eq. (9), we get that (cid:88) w (Pr j ( w | ˆ w ) − Pr ( w | ˆ w )) Pr ( w | ˆ w ) ≤ . nρ (cid:88) x j D kl (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w )) . D kl (Pr ( w | ˆ w ) || Pr j ( w | ˆ w )) ≤ . nρ (cid:88) x j D kl (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w )) . This expression can be equivalently written as . n n ρ (cid:88) x j n D kl (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w ))= 9 . n n ρ (cid:88) x j Pr ( x j | ˆ w ) D kl (Pr ( w | x j , ˆ w ) || Pr ( w | ˆ w ))= 9 . n n ρ I Pr ( ·| ˆ w ) ( w ; x j ) where I Pr ( ·| ˆ w ) ( w ; x j ) denotes the mutual information between w and x j , under the (uniform) distributionon x j induced by Pr ( ·| ˆ w ) . This allows us to upper bound Eq. (5) as follows: d d (cid:88) j =1 D kl (Pr ( w | ˆ w ) || Pr j ( w | ˆ w )) ≤ . n n ρ d d (cid:88) j =1 I Pr ( ·| ˆ w ) ( w ; x j ) . Since x , . . . , x d are independent of each other and w contains at most b bits, we can use the key Lemma 6to upper bound the above by . n n ρ b/d .To summarize, this expression constitutes an upper bound on Eq. (5), i.e. on any individual term insidethe expectation in Eq. (4). Thus, we can upper bound Eq. (4) by . mn n ρ b/d < mn n ρ b/d .This shows that Eq. (3) indeed holds, which as explained earlier implies the required result. A.3 Proof of Thm. 3 The proof builds on the auxiliary lemmas presented in Appendix A.1. It begins similarly to the proof ofThm. 2, but soon diverges.On top of the distributions Pr j ( · ) defined in the hide-and-seek problem (Definition 2), we define anadditional ‘reference’ distribution Pr ( · ) , which corresponds to the instances x chosen uniformly at randomfrom {− , +1 } d (i.e. there is no biased coordinate).Let w , . . . , w m denote the messages computed by the protocol. To show the upper bound, it is enoughto prove that d d (cid:88) j =1 D kl (cid:0) Pr ( w . . . w m ) (cid:12)(cid:12)(cid:12)(cid:12) Pr j ( w . . . w m ) (cid:1) ≤ min (cid:26) mnρbd , mnρ (cid:27) (12)since then by applying Lemma 1, we get that for some j , Pr j ( ˜ J = j ) ≤ (3 /d )+2 (cid:112) min { mnρb/d, mnρ } ≤ (3 /d ) + 5 (cid:112) mn min { ρb/d, ρ } as required.Using the chain rule, the left hand side in Eq. (12) equals d d (cid:88) j =1 m (cid:88) t =1 E w ...w t − ∼ Pr (cid:2) D kl (cid:0) Pr ( w t | w . . . w t − ) || Pr j ( w t | w . . . w t − ) (cid:1)(cid:3) = 2 m (cid:88) t =1 E w ...w t − ∼ Pr d d (cid:88) j =1 D kl (cid:0) Pr ( w t | w . . . w t − ) || Pr j ( w t | w . . . w t − ) (cid:1) (13)21et us focus on a particular choice of t and values w . . . w t − . To simplify the presentation, we drop the t superscript from the message w t , and denote the previous messages w . . . w t − as ˆ w . Thus, we considerthe quantity d d (cid:88) j =1 D kl (Pr ( w | ˆ w ) || Pr j ( w | ˆ w )) . (14)Recall that w is some function of ˆ w and a set of n independent instances received in the current round. Let x j denote the vector of values at coordinate j across these n instances. Clearly, under Pr j , every x i for i (cid:54) = j is uniformly distributed on {− , +1 } n , whereas each entry of x j equals with probability + ρ , and − otherwise.We now show that Eq. (14) can be upper bounded in two different ways, one bound being nρb/d andthe other being nρ . Combining the two, we get that d d (cid:88) j =1 D kl (Pr ( w | ˆ w ) || Pr j ( w | ˆ w )) ≤ min (cid:26) nρbd , nρ (cid:27) . (15)Plugging this inequality back in Eq. (13), we validate Eq. (12), from which the result follows. The nρ bound This bound essentially follows only from the fact that x j is noisy, and not from the algorithm’s informationconstraints, and is thus easier to obtain.First, we have by Lemma 2 that for any w, ˆ w , Pr j ( w | ˆ w ) = (cid:88) x j Pr ( w | ˆ w )Pr j ( x j | ˆ w ) = (cid:88) x j Pr ( w | ˆ w )Pr j ( x j ) (this is the same as Eq. (6), and the justification is the same).Using this inequality, the definition of relative entropy, and the log-sum inequality, we have d d (cid:88) j =1 D kl (Pr ( w | ˆ w ) || Pr j ( w | ˆ w )) = 1 d d (cid:88) j =1 (cid:88) w Pr ( w | ˆ w ) log (cid:18) Pr ( w | ˆ w )Pr j ( w | ˆ w ) (cid:19) = 1 d d (cid:88) j =1 (cid:88) w Pr ( w | ˆ w ) (cid:88) x j Pr ( x j ) log (cid:32) (cid:80) x j Pr ( w | x j , ˆ w )Pr ( x j ) (cid:80) x j Pr ( w | x j , ˆ w )Pr j ( x j ) (cid:33) ≤ d d (cid:88) j =1 (cid:88) w Pr ( w | ˆ w ) (cid:88) x j Pr ( x j ) log (cid:18) Pr ( w | x j , ˆ w )Pr ( x j )Pr ( w | x j , ˆ w )Pr j ( x j ) (cid:19) = 1 d d (cid:88) j =1 (cid:88) w Pr ( w | ˆ w ) (cid:88) x j Pr ( x j ) log (cid:18) Pr ( x j )Pr j ( x j ) (cid:19) = 1 d d (cid:88) j =1 (cid:88) x j Pr ( x j ) log (cid:18) Pr ( x j )Pr j ( x j ) (cid:19) = 1 d d (cid:88) j =1 D kl (Pr ( x j ) || Pr j ( x j )) . n independent Bernoulli trials with parameter / , and n independent Bernoulli trials with parameter / ρ . This is easily verified to equal n times the relativeentropy for a single trial, which equals (by definition of relative entropy) 12 log (cid:18) / / − ρ (cid:19) + 12 log (cid:18) / / ρ (cid:19) = − 12 log (cid:0) − ρ (cid:1) ≤ / ρ , where we used the fact that ρ ≤ / n ≤ / , and the inequality − log(1 − x ) ≤ / x for x ∈ [0 , / .Overall, we get that d d (cid:88) j =1 D kl (Pr ( w | ˆ w ) || Pr j ( w | ˆ w )) ≤ / nρ ≤ nρ . The nρb/d bound To prove this bound, it will be convenient for us to describe the sampling process of x j in a slightly morecomplex way, as follows : • We let v ∈ { , } n be an auxiliary random vector with independent entries, where each v i = 1 withprobability ρ , and otherwise. • Under Pr and Pr i for i (cid:54) = j , we assume that x j is drawn uniformly from {− , +1 } n regardless ofthe value of v . • Under Pr j , we assume that each entry x j,l is independently sampled (in a manner depending on v ) asfollows: – For each l such that v l = 1 , we pick x j,l to be with probability / , and − otherwise. – For each l such that v l = 0 , we pick x j,l to be or − with probability / .Note that this induces the same distribution on x j as before: Each individual entry x j,l is independent andsatisfies Pr j ( x j,l = 1) = 4 ρ ∗ + (1 − ρ ) ∗ = + ρ .Having finished with these definitions, we re-write Eq. (14) as d d (cid:88) j =1 D kl ( E v [Pr ( w | v , ˆ w )] || E v [Pr j ( w | v , ˆ w )]) . Since the relative entropy is jointly convex in its arguments, and v is a fixed random variable, we have byJensen’s inequality that this is at most E v d d (cid:88) j =1 D kl (Pr ( w | v , ˆ w ) || Pr j ( w | v , ˆ w )) . Now, note that if v = (i.e. the zero-vector), then the distribution of x , . . . , x d is the same under both Pr and any Pr j . Since w is a function of x , . . . , x d , it follows that the distribution of w will be the same under We suspect that this construction can be simplified, but were unable to achieve this without considerably weakening the bound. Pr j and Pr , and therefore the relative entropy terms will be zero. Hence, we can trivially re-write theabove as E v v (cid:54) = d d (cid:88) j =1 D kl (Pr ( w | v , ˆ w ) || Pr j ( w | v , ˆ w )) . (16)where v (cid:54) = is an indicator function.We can now use Lemma 2, where p ( · ) = Pr j ( ·| v , ˆ w ) , q ( · ) = Pr ( ·| v , ˆ w ) and A i = {− , +1 } n (i.e. thevector of values at a single coordinate i across the n data points). The lemma’s conditions are satisfied since x i for i (cid:54) = j has the same distribution under Pr ( ·| v , ˆ w ) and Pr j ( ·| v , ˆ w ) , and also w is only a function of x . . . x d and ˆ w . Thus, we can rewrite Eq. (16) as E v v (cid:54) = d d (cid:88) j =1 D kl Pr ( w | v , ˆ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x j Pr ( w | x j , v , ˆ w )Pr j ( x j | v , ˆ w ) . Using Lemma 3, we can reverse the expressions in the relative entropy term, and upper bound the above by E v v (cid:54) = d d (cid:88) j =1 (cid:32) max w Pr ( w | v , ˆ w ) (cid:80) x j Pr ( w | x j , v , ˆ w )Pr j ( x j | v , ˆ w ) (cid:33) D kl (cid:88) x j Pr ( w | x j , v , ˆ w )Pr j ( x j | v , ˆ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Pr ( w | v , ˆ w ) . (17)The max term equals max w (cid:80) x j Pr ( w | x j , v , ˆ w )Pr ( x j | v , ˆ w ) (cid:80) x j Pr ( w | x j , v , ˆ w )Pr j ( x j | v , ˆ w ) ≤ max x j Pr ( x j | v , ˆ w )Pr j ( x j | v , ˆ w ) , and using Jensen’s inequality and the fact that relative entropy is convex in its arguments, we can upperbound the relative entropy term by (cid:88) x j Pr j ( x j | v , ˆ w ) D kl (Pr ( w | x j , v , ˆ w ) || Pr ( w | v , ˆ w )) ≤ (cid:18) max x j Pr j ( x j | v , ˆ w )Pr ( x j | v , ˆ w ) (cid:19) (cid:88) x j Pr ( x j | v , ˆ w ) D kl (Pr ( w | x j , v , ˆ w ) || Pr ( w | v , ˆ w )) . The sum in the expression above equals the mutual information between the message w and the coor-dinate vector x j (seen as random variables with respect to the distribution Pr ( ·| v , ˆ w ) ). Writing this as I Pr ( ·| v , ˆ w ) ( w ; x j ) , we can thus upper bound Eq. (17) by E v v (cid:54) = d d (cid:88) j =1 (cid:18) max x j Pr ( x j | v , ˆ w )Pr j ( x j | v , ˆ w ) (cid:19) (cid:18) max x j Pr j ( x j | v , ˆ w )Pr ( x j | v , ˆ w ) (cid:19) I Pr ( ·| v , ˆ w ) ( w ; x j ) ≤ E v v (cid:54) = (cid:18) max j, x j Pr ( x j | v , ˆ w )Pr j ( x j | v , ˆ w ) (cid:19) (cid:18) max j, x j Pr j ( x j | v , ˆ w )Pr ( x j | v , ˆ w ) (cid:19) d d (cid:88) j =1 I Pr ( ·| v , ˆ w ) ( w ; x j ) . Since { x j } j are independent of each other and w contains at most b bits, we can use the key Lemma 6 toupper bound the above by E v (cid:20) v (cid:54) = (cid:18) max j, x j Pr ( x j | v , ˆ w )Pr j ( x j | v , ˆ w ) (cid:19) (cid:18) max j, x j Pr j ( x j | v , ˆ w )Pr ( x j | v , ˆ w ) (cid:19) bd (cid:21) . j , x j refers to a column of n independent entries, drawn independently of anyprevious messages ˆ w , where under Pr , each entry x j,i is chosen to be ± with equal probability, whereasunder Pr j each is chosen to be with probability if v i = 1 , and with probability if v i = 0 . Therefore,letting | v | denote the number of non-zero entries in v , we can upper bound the expression above by E v (cid:34) v (cid:54) = (cid:18) / / (cid:19) | v | (cid:18) / / (cid:19) | v | bd (cid:35) = bd E v (cid:104) v (cid:54) = | v | (cid:105) , (18)To compute the expectation in closed-form, recall that each entry of v is picked independently to be withprobability ρ , and otherwise. Therefore, E v (cid:104) v (cid:54) = | v | (cid:105) = E v (cid:104) | v | − v = (cid:105) = n (cid:89) i =1 E v i [3 v i ] − Pr( v = )= ( E v [3 v ]) n − Pr( v = )= (4 ρ ∗ − ρ ) ∗ n − (1 − ρ ) n = (1 + 8 ρ ) n − (1 − ρ ) n ≤ exp(8 nρ ) − (1 − nρ ) , where in the last inequality we used the facts that (1 + a/n ) n ≤ exp( a ) and (1 − a ) n ≥ − an . Since weassume ρ ≤ / n , nρ ≤ , so we can use the inequality exp( x ) ≤ . x , which holds for any x ∈ [0 , ,and get that the expression above is at most (1 + 26 nρ ) − (1 − nρ ) = 30 nρ , and therefore Eq. (18) is atmost nρb/d . This in turn is an upper bound on Eq. (14) as required. A.4 Proof of Thm. 4 Let c , c be positive parameters to be determined later, and assume by contradiction that our algorithm canguarantee E [ (cid:80) Tt =1 (cid:96) t,i t − (cid:80) Tt =1 (cid:96) t,j ] < c min { T / , (cid:112) dT /b } for any distribution and all j .Consider the set of distributions Pr j ( · ) over { , } d , where each coordinate is chosen independently anduniformly, except coordinate j which equals with probability + ρ , where ρ = c min { / , (cid:112) d/bT } .Clearly,the coordinate i which minimizes E [ (cid:96) t,i ] is j . Moreover, if at round t the learner chooses some i t (cid:54) = j , then E [ (cid:96) t,i t − (cid:96) t,j ] = ρ = c min { / , (cid:112) d/bT } . Thus, to have E [ (cid:80) Tt =1 (cid:96) t,i t − (cid:80) Tt =1 (cid:96) t,j ] 2) = 2 c /c . In other words, if we can guarantee regret smaller than c min { T / , (cid:112) dT /b } ,then we can detect j with probability at least − c /c , simply by taking the most common coordinate.However, by Thm. 2, for any ( b, , T ) protocol, there is some j such that the protocol would correctlydetect j with probability at most d + 21 (cid:115) T bd c min (cid:26) , dbT (cid:27) ≤ d + 21 c . Therefore, assuming d > , and taking for instance c = 3 . ∗ − , c = 5 . ∗ − , we get that theprobability of detection is at most +21 c < . , whereas the scheme discussed in the previous paragraphguarantees detection with probability at least − c /c > . . We have reached a contradiction, henceour initial hypothesis is false and our algorithm must suffer regret at least c min { T / , (cid:112) dT /b } . The theorem discusses the case where the distribution is over {− , +1 } d , and coordinate j has a slight positive bias, but it’seasily seen that the lower bound also holds here where the domain is { , } d . .5 Proof of Thm. 5 The proof is rather involved, and is composed of several stages. First, we define a variant of our hide-and-seek problem, which depends on sparse distributions. We then prove an information-theoretic lower boundon the achievable performance for this hide-and-seek problem with information constraints. The bound issimilar to Thm. 3, but without an explicit dependence on the bias ρ . We then show how the lower boundcan be strengthened in the specific case of b -memory online protocols. Finally, we use these ingredients inproving Thm. 5.We begin by defining the following hide-and-seek problem, which differs from problem 2 in that thedistribution is supported on sparse instances. It is again parameterized by a dimension d , bias ρ , and samplesize mn : Definition 3 (Hide-and-seek Problem 2) . Consider the set of distributions { Pr j ( · ) } dj =1 over {− e i , + e i } di =1 , defined as Pr j ( e i ) = (cid:40) d i (cid:54) = j d + ρd i = j Pr j ( − e i ) = (cid:40) d i (cid:54) = j d − ρd i = j . Given an i.i.d. sample of mn instances generated from Pr j ( · ) , where j is unknown, detect j . In words, Pr j ( · ) corresponds to picking ± e i where i is chosen uniformly at random, and the sign ischosen uniformly if i (cid:54) = j , and positive (resp. negative) with probability + ρ (resp. − ρ ) if i = j . Itis easily verified that this creates sparse instances with zero-mean coordinates, except coordinate j whoseexpectation is ρ/d .We now present a result similar to Thm. 3 for this new hide-and-seek problem: Theorem 6. Consider hide-and-seek problem 2 on d > coordinates, with some bias ρ ≤ min { , 19 log( d ) , d n } . Then for any estimate ˜ J of the biased coordinate returned by any ( b, n, m ) protocol, there exists some coordinate j such that Pr J ( ˜ J = j ) ≤ d + 11 (cid:114) mbd . The proof appears in subsection A.6 below, and is broadly similar to the proof of Thm. 3 (although usinga somewhat different approach).The theorems above hold for any ( b, n, m ) protocol, and in particular for b -memory online protocols(since they are a special case of ( b, , m ) protocols). However, for b -online protocols, the following simpleobservation will allow us to further strengthen our results: Theorem 7. Any b -memory online protocol over m instances is also a (cid:0) b, κ, (cid:4) mκ (cid:5)(cid:1) protocol for any positiveinteger κ ≤ m . The proof is immediate: Given a a batch of κ instances, we can always feed the instances one by oneto our b -memory online protocol, and output the final message after (cid:98) m/κ (cid:99) such batches are processed,ignoring any remaining instances. This makes the algorithm a type of (cid:0) b, κ, (cid:4) mκ (cid:5)(cid:1) protocol. Attaining a dependence on ρ seems technically complex for this hide-and-seek problem, but fortunately is not needed to proveThm. 5. 26s a result, when discussing b -memory online protocols for some particular value of m , we can actuallyapply Thm. 6 where we replace n, m by κ, (cid:98) m/κ (cid:99) , where κ is a free parameter we can tune to attain themost convenient bound.With these results at hand, we turn to prove Thm. 5.The lower bound follows from the concentration of measure assumption on (cid:103) x i x j , and a union bound,which implies that Pr (cid:16) ∀ i < j, | (cid:103) x i x j − E [ x i x j ] | < τ (cid:17) ≥ − d ( d − (cid:0) − mτ / (cid:1) ≥ − d exp (cid:0) − mτ / (cid:1) . If this event occurs, then picking ( ˜ I, ˜ J ) to be the coordinates with the largest empirical mean would indeedsucceed in detecting ( i ∗ , j ∗ ) , since E [ x i ∗ x j ∗ ] ≥ E [ x i x j ] + τ for all ( i, j ) (cid:54) = ( i ∗ , j ∗ ) .The upper bound in the theorem statement follows from a reduction to the setting discussed in Thm. 6.Let { Pr i ∗ ,j ∗ ( · ) } ≤ i ∗ 19 log (cid:16) d ( d − (cid:17) , d ( d − n (cid:41) , then for some ( i ∗ , j ∗ ) and any estimator ( ˜ I, ˜ J ) returned by a ( b, n, m ) protocol, Pr i ∗ ,j ∗ (cid:16) ( ˜ I, ˜ J ) = ( i ∗ , j ∗ ) (cid:17) ≤ d ( d − 1) + 11 (cid:115) mbd ( d − . Our theorem deals with two types of protocols: (cid:16) b, d ( d − , (cid:98) md ( d − (cid:99) (cid:17) protocols, and b -memory onlineprotocols over m instances. In the former case, we can simply plug in (cid:106) md ( d − (cid:107) , d ( d − instead of m, n ,while in the latter case we can still replace m, n by (cid:106) md ( d − (cid:107) , d ( d − thanks to Thm. 7. In both cases,doing this replacement and choosing ρ = 19 log (cid:16) d ( d − (cid:17) (which is justified when d ≥ , as we assume), weget that Pr i ∗ ,j ∗ (cid:16) ( ˜ I, ˜ J ) = ( i ∗ , j ∗ ) (cid:17) ≤ d ( d − 1) + 11 (cid:115) bd ( d − (cid:22) md ( d − (cid:23) ≤ O (cid:18) d + (cid:114) md /b (cid:19) . (19)This implies the upper bound stated in the theorem, and also noting that τ = 2 ρd − d − 1) log (cid:16) d ( d − (cid:17) = Θ (cid:18) d log( d ) (cid:19) . Having finished with the proof of the theorem as stated, we note that it is possible to extend the con-struction used here to show performance gaps for other sample sizes m . For example, instead of using adistribution supported on (cid:40)(cid:114) d σ e i + σ e j ) (cid:41) ≤ i 19 log( d ) , d n } , it holds that the expression above is at most b/d . To summarize, this is a validupper bound on Eq. (22), i.e. on any individual term inside the expectation in Eq. (21). Thus, we can upperbound Eq. (21) by mb/d . This shows that Eq. (20) indeed holds, which as explained earlier implies therequired result. B Basic Results in Information Theory The proof of Thm. 3 and Thm. 6 makes extensive use of quantities and basic results from information theory.We briefly review here the technical results relevant for our paper. A more complete introduction may befound in [26]. Following the settings considered in the paper, we will focus only on discrete distributionstaking values on a finite set.Given a random variable X taking values in a domain X , and having a distribution function p ( · ) , wedefine its entropy as H ( X ) = (cid:88) x ∈X p ( x ) log (1 /p ( x )) = E X log (cid:18) p ( x ) (cid:19) . Intuitively, this quantity measures the uncertainty in the value of X . This definition can be extended tojoint entropy of two (or more) random variables, e.g. H ( X, Y ) = (cid:80) x,y p ( x, y ) log (1 /p ( x, y )) , and toconditional entropy H ( X | Y ) = (cid:88) y p ( y ) (cid:88) x p ( x | y ) log (cid:18) p ( x | y ) (cid:19) . y of Y , we have H ( X | Y = y ) = (cid:88) x p ( x | y ) log (cid:18) p ( x | y ) (cid:19) It is possible to show that (cid:80) nj =1 H ( X i ) ≥ H ( X , . . . , X n ) , with equality when X , . . . , X n are indepen-dent. Also, H ( X ) ≥ H ( X | Y ) (i.e. conditioning can only reduce entropy). Finally, if X is supported on adiscrete set of size b , then H ( X ) is at most b .Mutual information I ( X ; Y ) between two random variables X, Y is defined as I ( X ; Y ) = H ( X ) − H ( X | Y ) = H ( Y ) − H ( Y | X ) = (cid:88) x,y p ( x, y ) log (cid:18) p ( x, y ) p ( x ) p ( y ) (cid:19) . Intuitively, this measures the amount of information each variable carries on the other one, or in other words,the reduction in uncertainty on one variable given we know the other. Since entropy is always positive, weimmediately get I ( X ; Y ) ≤ min { H ( X ) , H ( Y ) } . As for entropy, one can define the conditional mutualinformation between random variables X,Y given some other random variable Z as I ( X ; Y | Z ) = E z ∼ Z [ I ( X ; Y | Z = z )] = (cid:88) z p ( z ) (cid:88) x,y p ( x, y | z ) log (cid:18) p ( x, y | z ) p ( x | z ) p ( y | z ) (cid:19) . Finally, we define the relative entropy (or Kullback-Leibler divergence) between two distributions p, q on the same set as D kl ( p || q ) = (cid:88) x p ( x ) log (cid:18) p ( x ) q ( x ) (cid:19) . It is possible to show that relative entropy is always non-negative, and jointly convex in its two arguments(viewed as vectors in the simplex). It also satisfies the following chain rule: D kl ( p ( x . . . x n ) || q ( y . . . y n ) = n (cid:88) i =1 E x ...x i − ∼ p [ D kl ( p ( x i | x . . . x i − ) || q ( x i | x . . . x i − ))] . Also, it is easily verified that I ( X ; Y ) = (cid:88) y p ( y ) D kl ( p X ( ·| y ) || p X ( · )) , where p X is the distribution of the random variable X . In addition, we will make use of Pinsker’s inequality,which upper bounds the so-called total variation distance of two distributions p, q in terms of the relativeentropy between them: (cid:88) x | p ( x ) − q ( x ) | ≤ (cid:112) D kl ( p || q ) . Finally, an important inequality we use in the context of relative entropy calculations is the log-suminequality. This inequality states that for any nonnegative a i , b i , (cid:32)(cid:88) i a i (cid:33) log (cid:80) i a i (cid:80) i b i ≤ (cid:88) i a i log a i b i ..