Concentration Bounds for High Sensitivity Functions Through Differential Privacy
CConcentration Bounds for High Sensitivity FunctionsThrough Di ff erential Privacy (cid:42) Kobbi Nissim † Uri Stemmer ‡ March 7, 2017
Abstract
A new line of work [6, 9, 15, 2] demonstrates how di ff erential privacy [8] can be used asa mathematical tool for guaranteeing generalization in adaptive data analysis. Specifically,if a di ff erentially private analysis is applied on a sample S of i.i.d. examples to select a low-sensitivity function f , then w.h.p. f ( S ) is close to its expectation, although f is being chosenbased on the data.Very recently, Steinke and Ullman [16] observed that these generalization guarantees can beused for proving concentration bounds in the non-adaptive setting, where the low-sensitivityfunction is fixed beforehand. In particular, they obtain alternative proofs for classical concen-tration bounds for low-sensitivity functions, such as the Cherno ff bound and McDiarmid’sInequality.In this work, we set out to examine the situation for functions with high -sensitivity, forwhich di ff erential privacy does not imply generalization guarantees under adaptive analysis.We show that di ff erential privacy can be used to prove concentration bounds for such functionsin the non-adaptive setting. Keywords: Di ff erential privacy, concentration bounds, high sensitivity functions (cid:42) Research by K.N. and U.S. is supported by NSF grant No. 1565387. † Dept. of Computer Science, Georgetown University and
Center for Research on Computation and Society (CRCS),Harvard University. [email protected] . ‡ Center for Research on Computation and Society (CRCS), Harvard University. [email protected] . a r X i v : . [ c s . L G ] M a r Introduction
A new line of work [6, 9, 15, 2] demonstrates how di ff erential privacy [8] can be used as a math-ematical tool for guaranteeing statistical validity in data analysis. Specifically, if a di ff erentiallyprivate analysis is applied on a sample S of i.i.d. examples to select a low-sensitivity function f ,then w.h.p. f ( S ) is close to its expectation, even when f is being chosen based on the data. Dworket al. [6] showed how to utilize this connection for the task of answering adaptively chosen queriesw.r.t. an unknown distribution using i.i.d. samples from it.To make the setting concrete, consider a data analyst interested in learning properties of anunknown distribution D . The analyst interacts with the distribution D via a data curator A holdinga database S containing n i.i.d. samples from D . The interaction is adaptive, where at every roundthe analyst specifies a query q : X n → R and receives an answer a q ( S ) that (hopefully) approximates q ( D n ) (cid:44) E S (cid:48) ∼D n [ q ( S (cid:48) )]. As the analyst chooses its queries based on previous interactions with thedata, we run the risk of overfitting if A simply answers every query with its empirical value on thesample S . However, if A is a di ff erentially private algorithm then the interaction would not lead tooverfitting: Theorem 1.1 ([6, 2], informal) . A function f : X n → R has sensitivity λ if | f ( S ) − f ( S (cid:48) ) | ≤ λ forevery pair S, S (cid:48) ∈ X n di ff ering in only one entry. Define f ( D n ) (cid:44) E S (cid:48) ∼D n [ f ( S (cid:48) )] . Let A : X n → F λ be ( ε, δ ) -di ff erentially private where F λ is the class of λ -sensitive functions, and n ≥ ε log( εδ ) . Then forevery distribution D on X , Pr S ∼D n f ←A ( S ) [ | f ( S ) − f ( D n ) | ≥ ελn ] < δε . In words, if A is a di ff erentially private algorithm operating on a database containing n i.i.d.samples from the distribution D , then A cannot (with significant probability) identify a low-sensitivity function that behaves di ff erently on the sample S and on D n .Very recently, Steinke and Ullman [16] observed that Theorem 1.1 gives alternative proofsfor classical concentration bounds for low-sensitivity functions, such as the Cherno ff bound andMcDiarmid’s Inequality: Fix a function f : X n → R with sensitivity λ and consider the trivialmechanism A f that ignores its input and always outputs f . Such a mechanism is ( ε, δ )-di ff erentiallyprivate for any choice of ε, δ ≥ S ∼D n [ | f ( S ) − f ( D n ) | ≥ ελn ] < δε = 2 − Ω ( ε · n ) , (1)where the last equality follows by setting n = ε log( εδ ).In light of this result it is natural to ask if similar techniques yield concentration bounds formore general families of queries, and in particular queries that are not low-sensitivity functions.In this work we derive conditions under which this is the case. ff erential Privacy, Max-Information, and Typical Stability Let D be a fixed distribution over a domain X , and consider a family of functions mappingdatabases in X n to the reals, such that for every function f in the family we have that | f ( S ) − f ( D n ) | is small w.h.p. over S ∼ D n . Specifically, F α,β ( D ) = (cid:26) f : X n → R : Pr S ∼D n [ | f ( S ) − f ( D n ) | > α ] ≤ β (cid:27) . f ∈ F α,β ( D ) we have that its empirical value over a sample S ∼ D n is α -close to its expected value w.p. 1 − β . Now consider a di ff erentially private algorithm A : X n →F α,β ( D ) that takes a database and returns a function from F α,β ( D ). What can we say about thedi ff erence | f ( S ) − f ( D n ) | when f is chosen by A ( S ) based on the sample S itself?Using the notion of max-information , Dwork et al. [5] showed that if β is small enough, thenw.h.p. the di ff erence remains small. Informally, they showed that if A is di ff erentially private, thenPr S ∼D n f ←A ( S ) [ | f ( S ) − f ( D n ) | > α ] ≤ β · e ε · n . So, if A is a di ff erentially private algorithm that ranges over functions which are very concentratedaround their expected value (i.e., β < e − ε n ), then | f ( S ) − f ( D n ) | remains small (w.h.p.) even when f is chosen by A ( S ) based on the sample S . When β > e − ε n it is easy to construct examples where adi ff erentially private algorithm identifies a function f ∈ F α,β ( D ) such that | f ( S ) − f ( D n ) | is arbitrarilylarge with high probability. So, in general, di ff erential privacy does not guarantee generalizationfor adaptively chosen functions of this sort. However, a stronger notion than di ff erential privacy –typical stability – presented by Bassily and Freund [1] does guarantee generalization in this setting.Informally, they showed that if a typically stable algorithm B outputs a function f ∈ F α,β ( D ), then | f ( S ) − f ( D n ) | remains small. The results of this article provide another piece of this puzzle, as we show that (a variant of)di ff erential privacy can in some cases be used to prove that a function f is in F α,β ( D ). Notation.
Throughout this article we use the convention that f ( D n ) is the expected value of thefunction f over a sample containing n i.i.d. elements drawn according to the distribution D . Thatis, f ( D n ) (cid:44) E S ∼D n [ f ( S )].Fix a function f : X n → R , let D be a distribution over X , and let S ∼ D n . Our goal is to boundthe probability that | f ( S ) − f ( D n ) | is large by some (hopefully) easy-to-analyze quantity. To intuitour result, consider for example what we get by a simple application of Markov’s Inequality:Pr S ∼D n [ | f ( S ) − f ( D n ) | > λ ] ≤ λ · E S ∼D n (cid:104) | f ( S ) − f ( D n ) | >λ · | f ( S ) − f ( D n ) | (cid:105) . (2)We show that using di ff erential privacy we can replace the term | f ( S ) − f ( D n ) | in the expectationwith | f ( S ∪ { x } ) − f ( S ∪ { y } ) | , which can sometimes be easier to analyze. Specifically, we show thefollowing. Theorem 1.2 (part 1) . Let D be a distribution over a domain X , let f : X n → R , and let ∆ , λ ∈ R ≥ bes.t. for every ≤ i ≤ n it holds that E S ∼D n z ∼D (cid:20) | f ( S ) − f ( S ( i ← z ) ) | >λ · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ ∆ , (3) where S ( i ← z ) is the same as S except that the i th element is replaced with z . Then for every ε > we havethat Pr S ∼D n [ | f ( S ) − f ( D n ) | ≥ ελn ] < ∆ ελ , provided that n ≥ O (cid:16) ε · min { ,ε } log( λ · min { ,ε } ∆ ) (cid:17) . A similar notion – perfect generalization – was presented in [4]. λ -sensitive function f , we have that the expectation in Equation (3) is zero,so the statement holds for every choice of β > n ≥ O (cid:16) ε log( β ) (cid:17) , resulting in McDiarmid’sInequality (Equation (1)). Intuitively, Theorem 1.2 states that in order to obtain a high probabilitybound on | f ( S ) − f ( D n ) | is su ffi ces to analyze the “expectation of the tail” of (cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) , as afunction of the starting point λ .We also show that the above bound can be improved whenever the “expectation of the head” of (cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) is smaller than λ . Specifically, Theorem 1.2 (part 2) . If, in addition to (3), ∃ τ ≤ λ s.t. for every S ∈ X n and every ≤ i ≤ n we have E y,z ∼D (cid:20) | f ( S ( i ← y ) ) − f ( S ( i ← z ) ) | ≤ λ · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ( i ← y ) ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ τ, (4) Then for every ε > we have that Pr S ∼D n [ | f ( S ) − f ( D n ) | ≥ ετn ] < ∆ ετ , provided that n ≥ O (cid:16) λε · min { ,ε } τ log( τ · min { ,ε } ∆ ) (cid:17) Observe that while the expectation in (3) is over the entire sample S (as well as the replacementpoint), in requirement (4) the sample S is fixed. We do not know if this “worst-case” restriction isnecessary.In Section 4 we demonstrate how Theorem 1.2 can be used in proving a variety of concentrationbounds, such as a high probability bound on | f ( S ) − f ( D n ) | for Lipschitz functions. In additionwe show that Theorem 1.2 can be used to bound the probability that the number of triangles in arandom graph significantly exceeds the expectation. ff erential Privacy Our results rely on a number of basic facts about di ff erential privacy. An algorithm operating ondatabases is said to preserve di ff erential privacy if a change of a single record of the database doesnot significantly change the output distribution of the algorithm. Formally: Definition 2.1.
Databases S ∈ X n and S (cid:48) ∈ X n over a domain X are called neighboring if they di ff erin exactly one entry. Definition 2.2 (Di ff erential Privacy [8, 7]) . A randomized algorithm A : X n → Y is ( (cid:15), δ ) -di ff erentiallyprivate if for all neighboring databases S, S (cid:48) ∈ X n , and for every set of outputs T ⊆ Y , we havePr[ A ( S ) ∈ T ] ≤ e ε · Pr[ A ( S (cid:48) ) ∈ T ] + δ. The probability is taken over the random coins of A .3 .2 The Exponential Mechanism We next describe the exponential mechanism of McSherry and Talwar [14].
Definition 2.3 (Sensitivity) . The sensitivity (or global sensitivity ) of a function f : X n → R is thesmallest λ such that for every neighboring S, S (cid:48) ∈ X n , we have | f ( S ) − f ( S (cid:48) ) | ≤ λ . We use the term“ λ -sensitive function” to mean a function of sensitivity ≤ λ .Let X be a domain and H a set of solutions. Given a database S ∈ X ∗ , the exponential mechanismprivately chooses a “good” solution h out of the possible set of solutions H . This “goodness” isquantified using a quality function that matches solutions to scores. Definition 2.4 (Quality function) . A quality function is a function q : X ∗ × H → R that maps adatabase S ∈ X ∗ and a solution h ∈ H to a real number, identified as the score of the solution h w.r.t.the database S .Given a quality function q and a database S , the goal is to chooses a solution h approximatelymaximizing q ( S, h ). The exponential mechanism chooses a solution probabilistically, where theprobability mass that is assigned to each solution h increases exponentially with its quality q ( S, h ):The Exponential Mechanism
Input: privacy parameter ε >
0, finite solution set H , database S ∈ X n , and a λ -sensitive qualityfunction q .1. Randomly choose h ∈ H with probability exp ( ε λ · q ( S,h ) ) (cid:80) h (cid:48)∈ H exp ( ε λ · q ( S,h (cid:48) ) ) .
2. Output h . Theorem 2.5 (Properties of the exponential mechanism) . (i) The exponential mechanism is ( ε, -di ff erentially private. (ii) Let Opt ( S ) (cid:44) max f ∈ H { q ( S, f ) } and ∆ > . The exponential mechanism outputsa solution h such that q ( S, h ) ≤ ( Opt ( S ) − ∆ ) with probability at most | H | · exp (cid:16) − ε ∆ λ (cid:17) . Let X , . . . , X n be independent random variables where Pr[ X i = 1] = p and Pr[ X i = 0] = 1 − p forsome 0 < p <
1. Clearly, E [ (cid:80) ni =1 X i ] = pn . Cherno ff and Hoe ff ding bounds show that the sum isconcentrated around this expected value:Pr n (cid:88) i =1 X i > (1 + δ ) pn ≤ exp (cid:16) − pnδ / (cid:17) for 0 < δ ≤ , Pr n (cid:88) i =1 X i < (1 − δ ) pn ≤ exp (cid:16) − pnδ / (cid:17) for 0 < δ < , Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 X i − pn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > δ ≤ (cid:16) − δ /n (cid:17) for δ ≥ . The first two inequalities are known as the multiplicative Cherno ff bounds [3], and the lastinequality is known as the Hoe ff ding bound [10]. The next theorem states that the Cherno ff boundabove is tight up to constant factors in the exponent.4 heorem 2.6 (Tightness of Cherno ff bound [12]) . Let < p, δ ≤ , and let n ≥ δ p . Let X , . . . , X n beindependent random variables where Pr[ X i = 1] = p and Pr[ X i = 0] = 1 − p . Then, Pr n (cid:88) i =1 X i ≤ (1 − δ ) pn ≥ exp( − δ pn ) , Pr n (cid:88) i =1 X i ≥ (1 + δ ) pn ≥ exp( − δ pn ) . ff erential Privacy In this section we show how the concept of di ff erential privacy can be used to derive conditionsunder which a function f and a distribution D satisfy that | f ( S ) − f ( D n ) | is small w.h.p. when S ∼ D n .Our proof technique builds on the proof of Bassily et al. [2] for the generalization properties of adi ff erentially private algorithm that outputs a low-sensitivity function. The proof consists of twosteps:1. Let S , . . . , S T be T independent samples from D n (each containing n i.i.d. samples from D ).Let A be selection procedure that, given S , . . . , S T , chooses an index t ∈ [ T ] with the goalof maximizing | f ( S t ) − f ( D n ) | . We show that if A satisfies (a variant of) di ff erential privacythen, under some conditions on the function f and the distribution D , the expectation of | f ( S t ) − f ( D n ) | is bounded. That is, if A is di ff erentially private, then its ability to identify a“bad” index t with large | f ( S t ) − f ( D n ) | is limited.2. We show that if | f ( S ) − f ( D n ) | is large w.h.p. over S ∼ D n , then it is possible to construct analgorithm A satisfying (a variant of) di ff erential privacy that contradicts our expectationbound.We begin with a few definitions. Notations.
We use (cid:126)S ∈ ( X n ) T to denote a multi -database consisting of T databases of size n over X . Given a distribution D over a domain X we write (cid:126)S ∼ D nT to denote a multi-database sampledi.i.d. from D . Definition 3.1.
Fix a function f : X n → R mapping databases of size n over a domain X to thereals. We say that two multi-databases (cid:126)S = ( S , . . . , S T ) ∈ ( X n ) T and (cid:126)S (cid:48) = ( S (cid:48) , . . . , S (cid:48) T ) ∈ ( X n ) T are( f , λ ) -neighboring if for all 1 ≤ i ≤ T we have that | f ( S i ) − f ( S (cid:48) i ) | ≤ λ. Definition 3.2 (( ε, ( f , λ ))-di ff erential privacy) . Let M : ( X n ) T → Y be a randomized algorithm thatoperates on T databases of size n from X . For a function f : X n → R and parameters ε, λ ≥ M is ( ε, ( f , λ )) -di ff erentially private if for every set of outputs F ∈ Y and for every( f , λ )-neighboring (cid:126)S, (cid:126)S (cid:48) ∈ ( X n ) T it holds thatPr[ M ( (cid:126)S ) ∈ F ] ≤ e ε · Pr[ M ( (cid:126)S (cid:48) ) ∈ F ] . laim 3.3. Fix a function f : X n → R and parameters ε ≤ and λ ≥ . If M : ( X n ) T → Y is ( ε, ( f , λ )) -di ff erentially private then for every ( f , λ ) -neighboring databases (cid:126)S, (cid:126)S (cid:48) ∈ ( X n ) T and every function h : Y → R we have that E y ← M ( (cid:126)S ) [ h ( y )] ≤ E y ← M ( (cid:126)S (cid:48) ) [ h ( y )] + 4 ε · E y ← M ( (cid:126)S (cid:48) ) [ | h ( y ) | ] . Claim 3.3 follows from basic arguments in di ff erential privacy. The proof appears in theappendix for completeness. The proof of Theorem 1.2 contains somewhat unwieldy notation. For readability, we present here arestricted version of the theorem, tailored to the case where the function f computes the samplesum, which highlights most of the ideas in the proof. The full proof of Theorem 1.2 is included inthe appendix. Notation.
Given a sample S ∈ X n , we use ¯ f ( S ) to denote the sample sum, i.e., ¯ f ( S ) = (cid:80) x ∈ S x . Lemma 3.4 (Simplified Expectation Bound) . Let D be a distribution over a domain X such that E x ∼D [ x ] = 0 and E x ∼D (cid:104) {| x | > } · | x | (cid:105) ≤ ∆ . Fix < ε ≤ , and let A : ( X n ) T → [ T ] be an ( ε, ( ¯ f , -di ff erentiallyprivate algorithm that operates on T databases of size n from X , and outputs an index ≤ t ≤ T . Then (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S ∼D nT t ←A ( (cid:126)S ) (cid:104) ¯ f ( S t ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ εn + 2 nT ∆ . Proof.
We denote (cid:126)S = ( S , . . . , S T ), where every S t is itself a vector S t = ( x t, , . . . , x t,n ). We have: E (cid:126)S ∼D nT t ←A ( (cid:126)S ) (cid:104) ¯ f ( S t ) (cid:105) = (cid:88) i ∈ [ n ] E (cid:126)S ∼D nT E t ←A ( (cid:126)S ) (cid:2) x t,i (cid:3) = (cid:88) i ∈ [ n ] E (cid:126)S ∼D nT (cid:40) max m ∈ [ t ] | x m,i | ≤ (cid:41) · E t ←A ( (cid:126)S ) (cid:2) x t,i (cid:3) + (cid:40) max m ∈ [ t ] | x m,i | > (cid:41) · E t ←A ( (cid:126)S ) (cid:2) x t,i (cid:3) . (5)In the case where max m ∈ [ t ] | x m,i | > t ← A ( (cid:126)S ) with the deter-ministic choice for the maximal t (this makes the expression larger). When max m ∈ [ t ] | x m,i | ≤ A . Given a multi-sample (cid:126)S ∈ ( X n ) T we use (cid:126)S − i todenote a multi-sample identical to (cid:126)S , except that the i th element of every sub-sample is replacedwith 0. Using Claim 3.3 we get(5) ≤ (cid:88) i ∈ [ n ] E (cid:126)S ∼D nT (cid:40) max m ∈ [ t ] | x m,i | ≤ (cid:41) · E t ←A ( (cid:126)S − i ) (cid:2) x t,i (cid:3) + 4 ε E t ←A ( (cid:126)S − i ) (cid:2) | x t,i | (cid:3) + (cid:40) max m ∈ [ t ] | x m,i | > (cid:41) · max m ∈ [ T ] | x m,i | ≤ εn + (cid:88) i ∈ [ n ] E (cid:126)S ∼D nT (cid:40) max m ∈ [ t ] | x m,i | ≤ (cid:41) · E t ←A ( (cid:126)S − i ) (cid:2) x t,i (cid:3) + (cid:40) max m ∈ [ t ] | x m,i | > (cid:41) · max m ∈ [ T ] | x m,i | (6)6e next want to remove the first indicator function. This is useful as without it, the ex-pectation of a fresh example from D is zero. To that end we add and subtract the expression (cid:110) max m ∈ [ t ] | x m,i | > (cid:111) · E t ←A ( (cid:126)S − i ) (cid:2) x t,i (cid:3) to get (after replacing again E t with max t )(6) ≤ εn + (cid:88) i ∈ [ n ] E (cid:126)S ∼D nT E t ←A ( (cid:126)S − i ) (cid:2) x t,i (cid:3) + 2 · (cid:40) max m ∈ [ t ] | x m,i | > (cid:41) · max m ∈ [ T ] | x m,i | ≤ εn + 2 (cid:88) i ∈ [ n ] (cid:88) m ∈ [ T ] E (cid:126)S ∼D nT (cid:2) (cid:8) | x m,i | > (cid:9) · | x m,i | (cid:3) ≤ εn + 2 nT ∆ . Theorem 3.5 (Simplified High Probability Bound) . Let D be a distribution over a domain X such that E x ∼D [ x ] = 0 . Let ∆ ≥ be such that E x ∼D (cid:104) {| x | > } · | x | (cid:105) ≤ ∆ . Fix ≥ ε ≥ (cid:113) n ln(2 / ∆ ) . We have that Pr S ∼D n (cid:104) | ¯ f ( S ) | ≥ εn (cid:105) < ∆ ε . We present the proof idea of the theorem. Any informalities made hereafter are removed inSection A.
Proof sketch.
We only analyze the probability that ¯ f ( S ) is large. The analysis is symmetric forwhen ¯ f ( S ) is small. Assume towards contradiction that with probability at least ∆ ε we have that¯ f ( S ) ≥ εn . We now construct the following algorithm B that contradicts our expectation bound. Algorithm 1 B Input: T databases of size n each: (cid:126)S = ( S , . . . , S T ), where T (cid:44) (cid:98) ε/ ∆ (cid:99) .1. For i ∈ [ T ], define q ( (cid:126)S, i ) = ¯ f ( S i ).2. Sample t ∗ ∈ [ T ] with probability proportional to exp (cid:16) ε q ( (cid:126)S, t ) (cid:17) . Output: t. The fact that algorithm B is ( ε, ( ¯ f , ff erentially private follows from the standard analysisof the Exponential Mechanism of McSherry and Talwar [14]. The analysis appears in the fullversion of this proof (Section A) for completeness.Now consider applying B on databases (cid:126)S = ( S , . . . , S T ) containing i.i.d. samples from D . By ourassumption on D , for every t we have that ¯ f ( S t ) ≥ εn with probability at least ∆ ε . By our choiceof T = (cid:98) ε/ ∆ (cid:99) , we therefore getPr (cid:126)S ∼D nT (cid:34) max t ∈ [ T ] (cid:110) ¯ f ( S t ) (cid:111) ≥ εn (cid:35) ≥ − (cid:32) − ∆ ε (cid:33) T ≥ . The probability is taken over the random choice of the examples in (cid:126)S according to D . Had it beenthe case that the random variable max t ∈ [ T ] (cid:110) ¯ f ( S t ) (cid:111) is non-negative, we could have used Markov’s7nequality to get E (cid:126)S ∼D nT (cid:34) max t ∈ [ T ] (cid:110) q ( (cid:126)S, t ) (cid:111)(cid:35) = E (cid:126)S ∼D nT (cid:34) max t ∈ [ T ] (cid:110) ¯ f ( S t ) (cid:111)(cid:35) ≥ εn. (7)Even though it is not the case that max t ∈ [ T ] (cid:110) ¯ f ( S t ) (cid:111) is non-negative, we now proceed as ifEquation (7) holds. As described in the full version of this proof (Section A), this technical issuehas an easy fix. So, in expectation, max t ∈ [ T ] (cid:16) q ( (cid:126)S, t ) (cid:17) is large. In order to contradict the expectationbound of Theorem A.2, we need to show that this is also the case for the index t ∗ that is sampledon Step 2. To that end, we now use the following technical claim, stating that the expected qualityof a solution sampled as in Step 2 is high. Claim 3.6 (e.g., [2]) . Let H be a finite set, h : H → R a function, and η > . Define a random variable Y on H by Pr[ Y = y ] = exp( ηh ( y )) /C , where C = (cid:80) y ∈ H exp( ηh ( y )) . Then E [ h ( Y )] ≥ max y ∈ H h ( y ) − η ln | H | . For every fixture of (cid:126)S , we can apply Claim 3.6 with h ( t ) = q ( (cid:126)S, t ) and η = ε to get E t ∗ ∈ R [ T ] [ q ( (cid:126)S, t ∗ )] = E t ∗ ∈ R [ T ] (cid:20) ¯ f ( S t ∗ ) (cid:21) ≥ max t ∈ [ T ] (cid:110) ¯ f ( S t ) (cid:111) − ε ln( T ) . Taking the expectation also over (cid:126)S ∼ D nT we get that E (cid:126)S ∼D nT t ∗ ←B (cid:16) (cid:126)S (cid:17) (cid:20) ¯ f ( S t ∗ ) (cid:21) ≥ E (cid:126)S ∼D nT (cid:34) max t ∈ [ T ] (cid:110) ¯ f ( S t ) (cid:111)(cid:35) − ε ln( T ) ≥ εn − ε ln( T ) . This contradicts Theorem A.2 whenever ε > (cid:113) n ln( T ) = (cid:113) n ln(2 ε/ ∆ ). In this section we demonstrate how Theorem 1.2 can be used in proving a variety of concentrationbounds.
Recall that for a low-sensitivity function f , one could use McDiarmid’s Inequality to obtain a highprobability bound on the di ff erence | f ( S ) − f ( D n ) | , and this bound is distribution-independent . Thatis, the bound does not depend on D . Over the last few years, there has been some work on providingdistribution-dependent refinements to McDiarmid’s Inequality, that hold even for functions withhigh worst-case sensitivity, but with low “average-case” sensitivity, where “average” is with respectto the underlying distribution D . The following is one such refinement, by Kontorovich [13]. Definition 4.1 ([13]) . Let D be a distribution over a domain X , and let ρ : X → R ≥ . The sym-metrized distance of ( X, ρ, D ) is the random variable Ξ = ξ · ρ ( x, x (cid:48) ) where x, x (cid:48) ∼ D are independentand ξ is uniform on {± } independent of x, x (cid:48) . The subgaussian diameter of ( X, ρ, D ), denoted ∆ SG ( X, ρ, D ), is the smallest σ ∈ R ≥ such that E (cid:104) e λ Ξ (cid:105) ≤ e σ λ / , ∀ λ ∈ R .
8n [13], Kontorovich showed the following theorem:
Theorem 4.2 ([13], informal) . Let f : X n → R be a function mapping databases of size n over a domain X to the reals. Assume that there exists a function ρ : X → R ≥ s.t. for every i ∈ [ n ] , every S ∈ X n , andevery y, z ∈ X we have that (cid:12)(cid:12)(cid:12)(cid:12) f (cid:16) S ( i ← y ) (cid:17) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ρ ( y, z ) , where S ( i ← x ) is the same as S except that the i th element is replaced with x . Then, Pr S ∼D n [ | f ( S ) − f ( D n ) | ≥ t ] ≤ − t n · ∆ ( X, ρ, D ) . Informally, using the above theorem it is possible to obtain concentration bounds for functionswith unbounded sensitivity (in worst case), provided that the sensitivity (as a random variable)is subgaussian. In this section we show that our result implies a similar version of this theorem.While the bound we obtain is weaker then Theorem 4.2, our techniques can be extended to obtainconcentration bounds even in cases where the sensitivity is not subgaussian (that is, in cases wherethe subgaussian diameter is unbounded, and hence, Theorem 4.2 could not be applied).Let us denote σ = ∆ SG ( X, ρ, D ). Now for t ≥ x,y ∼D [ ρ ( x, y ) ≥ t ] ≤ x,y ∈D ξ ∈{± } [ ξ · ρ ( x, y ) ≥ t ] = 2 Pr[ Ξ ≥ t ] = 2 Pr[ e tσ · Ξ ≥ e tσ · t ] ≤ e − t σ · E (cid:20) e tσ · Ξ (cid:21) ≤ e − t σ · e σ · t σ = 2 exp (cid:32) − t σ (cid:33) . (8)So, E S ∼D n x (cid:48) ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) > λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ E x,y ∼D [ (cid:8) ρ ( x, y ) > λ (cid:9) · ρ ( x, y )]= (cid:90) λ Pr x,y ∼D [ (cid:8) ρ ( x, y ) > λ (cid:9) · ρ ( x, y ) ≥ t ] d t + (cid:90) ∞ λ Pr x,y ∼D [ (cid:8) ρ ( x, y ) > λ (cid:9) · ρ ( x, y ) ≥ t ] d t = (cid:90) λ Pr x,y ∼D [ ρ ( x, y ) ≥ λ ] d t + (cid:90) ∞ λ Pr x,y ∼D [ ρ ( x, y ) ≥ t ] d t = λ · Pr x,y ∼D [ ρ ( x, y ) ≥ λ ] + (cid:90) ∞ λ Pr x,y ∼D [ ρ ( x, y ) ≥ t ] d t ≤ λ · (cid:32) − λ σ (cid:33) + (cid:90) ∞ λ (cid:32) − t σ (cid:33) d t = λ · (cid:32) − λ σ (cid:33) + √ πσ · erfc (cid:32) λ √ σ (cid:33) ≤ λ · (cid:32) − λ σ (cid:33) + √ πσ · exp (cid:32) − λ σ (cid:33) ≤ λ + σ ) · exp (cid:32) − λ σ (cid:33) (cid:44) ∆ .
9n order to apply Theorem 1.2 we need to ensure that n ≥ O (cid:16) ε · min { ,ε } ln (cid:16) λ · min { ,ε } ∆ (cid:17)(cid:17) . For ourchoice of ∆ , it su ffi ces to set ε = Θ (cid:18) λ √ nσ (cid:19) , assuming that λ √ nσ ≤
1. Otherwise, if λ √ nσ >
1, we willchoose ε = Θ (cid:16) λ nσ (cid:17) . Plugging ( ε , ∆ ) or ( ε , ∆ ) into Theorem 1.2, and simplifying, we getPr S ∼D [ | f ( S ) − f ( D n ) | ≥ t ] ≤ e − Ω (cid:18) t √ nσ (cid:19) , t ≤ σ · n . e − Ω (cid:18) t / σ / (cid:19) , t > σ · n . (9)Clearly, the bound of Theorem 4.2 is stronger. Note, however, that the only assumption weused here is that (cid:82) ∞ λ Pr x,y ∼D [ ρ ( x, y ) ≥ t ]d t is small. Hence, as the following section shows, thisargument could be extended to obtain concentration bounds even when ∆ SG ( X, ρ, D ) is unbounded.We remark that Inequality 9 can be slightly improved by using part 2 of Theorem 1.2. This will beillustrated in the following section. Let f : X n → R be a function mapping databases of size n over a domain X to the reals. Assumethat there exists a function ρ : X → R ≥ s.t. for every i ∈ [ n ], every S ∈ X n , and every y, z ∈ X wehave that (cid:12)(cid:12)(cid:12)(cid:12) f (cid:16) S ( i ← y ) (cid:17) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ρ ( y, z ) , where S ( i ← x ) is the same as S except that the i th element is replaced with x .As stated in the previous section, the results of [13] can be used to obtain a high probabilitybound on | f ( S ) − f ( D n ) | whenever Pr x,y ∼D [ ρ ( x, y ) ≥ t ] ≤ exp (cid:16) − t /σ (cid:17) for some σ >
0. In contrast,our bound can be used whenever (cid:82) ∞ λ Pr x,y ∼D [ ρ ( x, y ) ≥ t ]d t is finite. In particular, we now use itto obtain a concentration bound for a case where the probability distribution of ρ ( x, y ) is heavytailed, and in fact, has infinite variance. Specifically, assume that all we know on ρ ( x, y ) is thatPr[ ρ ( x, y ) ≥ t ] ≤ /t for every t ≥ Pareto distribution , with infinitevariance). Let λ ≥
1. We calculate: E S ∼D n x (cid:48) ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) > λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ E x,y ∼D [ (cid:8) ρ ( x, y ) > λ (cid:9) · ρ ( x, y )]= (cid:90) λ Pr x,y ∼D [ (cid:8) ρ ( x, y ) > λ (cid:9) · ρ ( x, y ) ≥ t ] d t + (cid:90) ∞ λ Pr x,y ∼D [ (cid:8) ρ ( x, y ) > λ (cid:9) · ρ ( x, y ) ≥ t ] d t = (cid:90) λ Pr x,y ∼D [ ρ ( x, y ) ≥ λ ] d t + (cid:90) ∞ λ Pr x,y ∼D [ ρ ( x, y ) ≥ t ] d t = λ · Pr x,y ∼D [ ρ ( x, y ) ≥ λ ] + (cid:90) ∞ λ Pr x,y ∼D [ ρ ( x, y ) ≥ t ] d t ≤ λ λ + (cid:90) ∞ λ t d t = 2 λ (cid:44) ∆ .
10n order to apply Theorem 1.2 we need to ensure that n ≥ O (cid:16) ε · min { ,ε } ln (cid:16) λ · min { ,ε } ∆ + 1 (cid:17)(cid:17) . Assum-ing that n ≥ ln( λ ), with our choice of ∆ it su ffi ces to set ε = Θ (cid:18)(cid:113) n ln( λ ) (cid:19) . Plugging ε and ∆ intoTheorem 1.2, and simplifying, we getPr S ∼D [ | f ( S ) − f ( D n ) | ≥ t ] ≤ ˜ O (cid:32) n / t (cid:33) . (10)Observe that the above bound decays as 1 /t . This should be contrasted with Markov’s Inequal-ity, which would decay as 1 /t . Recall the assumption that the variance of ρ ( x, y ) is unbounded.Hence, the variance of f ( S ) can also be unbounded, and Chebyshev’s inequality could not beapplied.As we now explain, Inequality 10 can be improved using part 2 of Theorem 1.2. To that end,for a fixed database S ∈ X n , we calculate: E y,z ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f ( S ( i ← y ) ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ( i ← y ) ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ E y,z ∼D [ ρ ( y, z )] ≤ (cid:90) t + (cid:90) ∞ t d t = 2 (cid:44) τ. In order to apply part 2 of Theorem 1.2 we need to ensure that n ≥ O (cid:16) λε · min { ,ε } τ ln (cid:16) ετ ∆ (cid:17)(cid:17) . For ourchoice of ∆ and τ , if n ≥ λ ln( λ ) then it su ffi ces to set ε = Θ (cid:18)(cid:113) λn ln( λ ) (cid:19) . Otherwise, if n < λ ln( λ )then it su ffi ces to set ε = Θ (cid:16) λn ln( λ ) (cid:17) . Plugging ( ε , ∆ ) or ( ε , ∆ ) into Theorem 1.2, and simplifying,we get Pr S ∼D [ | f ( S ) − f ( D n ) | ≥ t ] ≤ ˜ O (cid:16) n t (cid:17) , t ≤ n ˜ O (cid:16) nt (cid:17) , t > n A random graph G ( N , p ) on N vertices 1 , , . . . , N is defined by drawing an edge between each pair1 ≤ i < j ≤ N independently with probability p . There are n = (cid:0) N (cid:1) i.i.d. random variables x { i,j } representing the choices: x { i,j } = x { j,i } = 1 if the edge { i, j } is drawn, and 0 otherwise. We will use D todenote the probability Pr x ∼D [ x = 1] = p and Pr x ∼D [ x = 0] = 1 − p , and let S = (cid:16) x { , } , . . . , x { n − ,n } (cid:17) ∼ D n .We say that three vertices i, j, (cid:96) form a triangle if there is an edge between any pair of them.Denote f K ( S ) the number of triangles in the graph defined by S . For a small constant α , we wouldlike to have an exponential bound on the following probabilityPr (cid:104) f K ( S ) ≥ (1 + α ) · f K ( D n ) (cid:105) . Specifically, we are interested in small values of p = o (1) such that f K ( D n ) = (cid:0) N (cid:1) p = Θ (cid:16) N p (cid:17) = o ( N ). The di ffi culty with this choice of p is that (in worst-case) adding a single edge to the graphcan increase the number of triangles by ( N − heorem 4.3 ([11], informal) . Let α be a small constant. It holds that exp (cid:16) − Θ (cid:16) p N log(1 /p ) (cid:17)(cid:17) ≤ Pr S ∼D n (cid:104) f K ( S ) ≥ (1 + α ) · f K ( D n ) (cid:105) ≤ exp (cid:16) − Θ (cid:16) p N (cid:17)(cid:17) . In this section we show that our result can be used to analyze this problem. While the boundwe obtain is much weaker than Theorem 4.3, we find it interesting that the same technique fromthe last sections can also be applied here. To make things more concrete, we fix p = N − / . In order to use our concentration bound, we start by analyzing the expected di ff erence incurredto f K by resampling a single edge. We will denote (cid:78) i,j ( S ) as the number of triangles that arecreated (or deleted) by adding (or removing) the edge { i, j } . That is, (cid:78) i,j ( S ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:110) (cid:96) (cid:44) i, j : x { i,(cid:96) } = 1 and x { (cid:96),j } = 1 (cid:111)(cid:12)(cid:12)(cid:12)(cid:12) . Observe that (cid:78) i,j ( S ) does not depend on x { i,j } . Moreover, observe that for every fixture of i < j wehave that (cid:78) i,j ( S ) is the sum of ( N −
2) i.i.d. indicators, each equals to 1 with probability p .Fix S = (cid:16) x { , } , . . . , x { n − ,n } (cid:17) ∈ { , } n and x (cid:48) ∈ { , } . We have that (cid:12)(cid:12)(cid:12)(cid:12) f K ( S ) − f K (cid:16) S ( { i,j }← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:40) , x { i,j } = x (cid:48) (cid:78) i,j ( S ) , x { i,j } (cid:44) x (cid:48) where S ( { i,j }← x (cid:48) ) is the same as S except with x { i,j } replaced with x (cid:48) . Fix i < j . We can now calculate E S ∼D n x (cid:48) ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f K ( S ) − f K (cid:16) S ( { i,j }← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) > λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f K ( S ) − f K (cid:16) S ( { i,j }← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) = E S ∼D n x (cid:48) ∼D (cid:104) (cid:110) x { i,j } (cid:44) x (cid:48) (cid:111) · (cid:110) (cid:78) i,j ( S ) > λ (cid:111) · (cid:78) i,j ( S ) (cid:105) = Pr x { i,j } ,x (cid:48) ∼D (cid:104) x { i,j } (cid:44) x (cid:48) (cid:105) · E S ∼D n (cid:104) (cid:110) (cid:78) i,j ( S ) > λ (cid:111) · (cid:78) i,j ( S ) (cid:105) = 2 p (1 − p ) · (cid:32) λ · Pr S ∼D n [ (cid:78) i,j ( S ) ≥ λ ] + (cid:90) Nλ Pr S ∼D n [ (cid:78) i,j ( S ) ≥ t ]d t (cid:33) ≤ pN · Pr S ∼D n [ (cid:78) i,j ( S ) ≥ λ ] . (11)Recall that (cid:78) i,j ( S ) is the sum of ( N −
2) i.i.d. indicators, each equals to 1 with probability p .We can upper bound the probability that (cid:78) i,j ( S ) ≥ λ with the probability that a sum of N suchrandom variables is at least λ . We will use the following variant of the Cherno ff bound, known asthe Cherno ff -Hoe ff ding theorem: Theorem 4.4 ([10]) . Let X , . . . , X n be independent random variables where Pr[ X i = 1] = p and Pr[ X i =0] = 1 − p for some < p < . Let k be s.t. p < kn < . Then, P r n (cid:88) i =1 X i ≥ k ≥ exp (cid:32) − n · D (cid:32) kn (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p (cid:33)(cid:33) , where D ( a (cid:107) b ) is the relative entropy between an a -coin and a p -coin (i.e. between the Bernoulli( a ) andBernoulli( p ) distribution): D ( a (cid:107) p ) = a · log (cid:32) ap (cid:33) + (1 − a ) · log (cid:32) − a − p (cid:33) . ff -Hoe ff ding theorem, for p N < λ < N , we have(11) ≤ pN · exp (cid:18) − N · D (cid:18) λN (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p (cid:19)(cid:19) . (12)Recall that we fixed p = N − / . Choosing λ = N / , we get:(12) = 2 pN · exp (cid:16) − N · D (cid:16) N − / (cid:13)(cid:13)(cid:13) N − / (cid:17)(cid:17) . (13)We will use the following claim to bound D (cid:16) N − / (cid:13)(cid:13)(cid:13) N − / (cid:17) : Claim 4.5.
Fix constants c > b > . For N ≥ max { /b , / ( c − b ) } we have that D (cid:16) N − b (cid:13)(cid:13)(cid:13) N − c (cid:17) ≥ c − b · N − b · log( N ) . Using Claim 4.5, for large enough N , we have that(13) ≤ pN · exp (cid:16) − N / (cid:17) . (14)So, denoting ∆ = 2 pN · exp (cid:16) − N / (cid:17) , we get that E S ∼D n x (cid:48) ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f K ( S ) − f K (cid:16) S ( { i,j }← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) > λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f K ( S ) − f K (cid:16) S ( { i,j }← x (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ ∆ . In order to obtain a meaningful bound, we will need to use part 2 of Theorem 1.2. To that end,for every fixture of S ∈ X n and i < j we can compute E y,z ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f K ( S ( { i,j }← y ) ) − f K (cid:16) S ( { i,j }← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f K ( S ( { i,j }← y ) ) − f K (cid:16) S ( { i,j }← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ E y,z ∼D [ { y (cid:44) z } · λ ]= 2 p (1 − p ) λ ≤ pλ (cid:44) τ. Finally, in order to apply Theorem 1.2, we need to ensure that n ≥ O (cid:16) λε min { ,ε } τ ln (cid:16) min { ,ε } τ ∆ (cid:17)(cid:17) .With our choices for ∆ and τ , it su ffi ces to set ε = Θ (cid:18)(cid:113) λnp (cid:19) . Plugging ε , ∆ and τ into Theorem 1.2,and simplifying, we get thatPr S ∼D n (cid:104) | f K ( S ) − f K ( D n ) | ≥ o (cid:16) f K ( D n ) (cid:17)(cid:105) < exp (cid:16) − N / (cid:17) . It remains to prove Claim 4.5:
Claim 4.5.
Fix constants c > b > . For N ≥ max { /b , / ( c − b ) } we have that D (cid:16) N − b (cid:13)(cid:13)(cid:13) N − c (cid:17) ≥ c − b · N − b · log( N ) .Proof of Claim 4.5. D (cid:16) N − b (cid:13)(cid:13)(cid:13) N − c (cid:17) = N − b · log (cid:16) N c − b (cid:17) + (cid:16) − N − b (cid:17) · log (cid:32) − N − b − N − c (cid:33) = N − b · log (cid:16) N c − b (cid:17) + (cid:16) − N − b (cid:17) · log (cid:32) N c − N c − b N c − (cid:33) = N − b · log (cid:16) N c − b (cid:17) + (cid:16) − N − b (cid:17) · log (cid:32) − N c − b − N c − (cid:33) (15)13sing the fact that log(1 − x ) ≥ − x for every 0 ≤ x ≤ , and assuming that N ≥ /b , we havethat (15) ≥ N − b · log (cid:16) N c − b (cid:17) − (cid:16) − N − b (cid:17) · N c − b − N c − N − b · log (cid:16) N c − b (cid:17) − · N c − b − N c − N − b · N c − b − N c − ≥ N − b · log (cid:16) N c − b (cid:17) − · N c − b − N c − ≥ N − b · log (cid:16) N c − b (cid:17) − · N c − b N c ≥ N − b · log (cid:16) N c − b (cid:17) − N − b (16)Assuming that N ≥ / ( c − b ) we get(16) ≥ · N − b · log (cid:16) N c − b (cid:17) ≥ c − b · N − b · log ( N ) . Let S be a sample of n i.i.d. elements from some distribution D . Recall that if a low-sensitivityfunction f is identified by a di ff erentially private algorithm operating on S , then w.h.p. f ( S ) ≈ f ( D n ) (cid:44) E S (cid:48) ∼D n [ f ( S (cid:48) )]. In this section we present a simple example showing that, in general, thisis not the case for high -sensitivity functions. Specifically, we show that a di ff erentially privatealgorithm operating on S can identify a high-sensitivity function f s.t. | f ( S ) − f ( D n ) | is arbitrarilylarge, even though | f ( S (cid:48) ) − f ( D n ) | is small for a fresh sample S (cid:48) ∼ D n . Theorem 5.1.
Fix β, ε, B > , let U be the uniform distribution over X = {± } d where d = poly(1 /β ) ,and let n ≥ O ( ε ln(1 /β )) . There exists an ( ε, -di ff erentially private algorithm A that operates on adatabase S ∈ ( {± } d ) n and returns a function mapping ( {± } d ) n to R , s.t. the following hold.1. For every f in the range of A it holds that Pr S (cid:48) ∼U n [ f ( S (cid:48) ) (cid:44) f ( U n )] ≤ β .2. Pr S ∼U n f ←A ( S ) [ | f ( S ) − f ( U n ) | ≥ B ] ≥ / .Proof. For t ∈ [ d ], define f t : ( {± } d ) n → R as f t ( x , . . . , x n ) = , (cid:12)(cid:12)(cid:12)(cid:80) i ∈ [ n ] x i,t (cid:12)(cid:12)(cid:12) ≤ (cid:112) n ln(2 /β ) B , (cid:80) i ∈ [ n ] x i,t > (cid:112) n ln(2 /β ) − B , (cid:80) i ∈ [ n ] x i,t < − (cid:112) n ln(2 /β )That is, given a database S of n rows from {± } d , we define f t ( S ) as 0 if the sum of column t (inabsolute value) is less than some threshold, and otherwise set f t ( S ) to be ± B (depending on the14ign of the sum). Observe that the global sensitivity of f t is B , and that f t ( U n ) (cid:44) E S (cid:48) ∼U n [ f t ( S (cid:48) )] = 0.Also, by the Hoe ff ding bound, we have thatPr S ∼U n [ f t ( S ) (cid:44) ≤ β. So, for every fixed t , with high probability over sampling S ∼ U n we have that f t ( S ) = 0 = f t ( U n ).Nevertheless, as we now explain, if d is large enough, then an ( ε, ff erentially private algorithmcan easily identify a “bad” index t ∗ such that | f t ∗ ( S ) | = B .Consider the algorithm that on input S = ( x , x , . . . , x n ) samples an index t ∈ [ d ] with probabilityproportional to exp (cid:16) ε (cid:12)(cid:12)(cid:12)(cid:80) i ∈ [ n ] x i,t (cid:12)(cid:12)(cid:12)(cid:17) . We will call it algorithm BadIndex .By the properties of the exponential mechanism, algorithm
BadIndex is ( ε, ff erentiallyprivate. Moreover, with probability at least 3 /
4, the output t ∗ satisfies (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] x i,t ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ max t ∈ [ d ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] x i,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ε ln (4 d ) . (17)In addition, by Theorem 2.6 (tightness of Cherno ff bound), for every fixed t it holds thatPr (cid:88) i ∈ [ n ] x i,t ≥ . · (cid:112) n ln(2 /β ) ≥ (cid:18) β (cid:19) . As the columns are independent, taking d = 2 (cid:16) β (cid:17) , we get thatPr max t ∈ [ d ] (cid:88) i ∈ [ n ] x i,t ≥ . · (cid:112) n ln(2 /β ) ≥ / . (18)Combining (17) and (18) we get that with probability at least 1 / BadIndex identifiesan index t ∗ such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] x i,t ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ . · (cid:112) n ln(2 /β ) − ε ln (4 d ) . Assuming that n ≥ O ( ε ln(1 /β )) we get that with probability at least 1 / BadIndex outputs an index t ∗ s.t. f t ∗ ( S ) = B . In this section we show that algorithm
BadIndex has relatively high max-information : Given two(correlated) random variables Y , Z , we use Y ⊗ Z denote the random variable obtained by drawingindependent copies of Y and Z from their respective marginal distributions. Definition 5.2 (Max-Information [5]) . Let Y and Z be jointly distributed random variables overthe domain ( Y , Z ). The β -approximate max-information between Y and Z is defined as I β ∞ ( Y ; Z ) = log sup O⊆ ( Y×Z ) , Pr[(
Y ,Z ) ∈O ] >β Pr[(
Y , Z ) ∈ O ] − β Pr[ Y ⊗ Z ∈ O ] . An algorithm A : X n → F has β -approximate max-information of k over product distributions,written I β ∞ ,P ( A , n ) ≤ k , if for every distribution D over X , we have I β ∞ ( S ; A ( S )) ≤ k when S ∼ D n .15t follows immediately from the definition that approximate max-information controls theprobability of “bad events” that can happen as a result of the dependence of A ( S ) on S : for everyevent O , we have Pr[( S, A ( S )) ∈ O ] ≤ k Pr[ S ⊗ A ( S ) ∈ O ] + β .Consider again algorithm BadIndex : ( {± } ) n → F that operates on database S of size n = O ( ε ln(1 /β )) and identifies, with probability 1/2, a function f s.t. f ( S ) (cid:44)
0, even though f ( S (cid:48) ) = 0w.p. 1 − β for a fresh sample S (cid:48) . Let us define O as the set of all pairs ( S, f ), where S is a databaseand f is a function in the range of algorithm BadIndex such that f ( S ) (cid:44)
0. That is, O = { ( S, f ) ∈ ( {± } ) n × F : f ( S ) (cid:44) } . If we assume that I / ∞ ,P ( BadIndex , n ) ≤ k , then by Definition 5.2 we have:12 ≤ Pr S ∼U n f ← BadIndex ( S ) [( S, f ) ∈ O ] ≤ e k · Pr S ∼U n T ∼U n f ← BadIndex ( T ) [( S, f ) ∈ O ] + 14 ≤ e k · β + 14 . So k ≥ ln( β ) = Ω ( ε n ). References [1] Raef Bassily and Yoav Freund. Typicality-based stability and privacy.
CoRR , abs/1604.03336,2016.[2] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and JonathanUllman. Algorithmic stability for adaptive data analysis. In
Proceedings of the 48th AnnualACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June18-21, 2016 , pages 1046–1059, 2016.[3] Herman Cherno ff . A measure of asymptotic e ffi ciency for tests of a hypothesis based on thesum of observations. Ann. Math. Statist. , 23:493–507, 1952.[4] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu. Adap-tive learning with robust generalization guarantees. In
Proceedings of the 29th Conference onLearning Theory, COLT 2016, New York, USA, June 23-26, 2016 , pages 772–814, 2016.[5] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and AaronRoth. Generalization in adaptive data analysis and holdout reuse. In
Advances in NeuralInformation Processing Systems (NIPS) , Montreal, December 2015.[6] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and AaronRoth. Preserving statistical validity in adaptive data analysis. In
ACM Symposium on theTheory of Computing (STOC) . ACM, June 2015.[7] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor.Our data, ourselves: Privacy via distributed noise generation. In Serge Vaudenay, editor,
EUROCRYPT , volume 4004 of
Lecture Notes in Computer Science , pages 486–503. Springer,2006.[8] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise tosensitivity in private data analysis. In
TCC , volume 3876 of
Lecture Notes in Computer Science ,pages 265–284. Springer, 2006. 169] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis ishard. In
FOCS , pages 454–463, 2014.[10] Wassily Hoe ff ding. Probability inequalities for sums of bounded random variables. Journal ofthe American Statistical Association , 58(301):13–30, 1963.[11] J. H. Kim and V. H. Vu. Divide and conquer martingales and the number of triangles in arandom graph.
Random Structures and Algorithms , 24(2):166–174, 2004.[12] Philip N. Klein and Neal E. Young. On the number of iterations for dantzig-wolfe optimizationand packing-covering approximation algorithms.
SIAM J. Comput. , 44(4):1154–1172, 2015.[13] Aryeh Kontorovich. Concentration in unbounded metric spaces and algorithmic stability.In
Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing,China, 21-26 June 2014 , pages 28–36, 2014.[14] Frank McSherry and Kunal Talwar. Mechanism design via di ff erential privacy. In FOCS , pages94–103. IEEE, Oct 20–23 2007.[15] Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness ofpreventing false discovery. In
COLT , pages 1588–1628, 2015.[16] Thomas Steinke and Jonathan Ullman. Subgaussian tail bounds via stability arguments.
ArXiv.org , (arXiv:1701.03493 [cs.DM]), 2017.[17] Van H. Vu. Concentration of non-lipschitz functions and applications.
Random Structures andAlgorithms , 20(3):262–316, 2002.
A Concentration Bounds Through Di ff erential Privacy – Missing De-tails Claim 3.3.
Fix a function f : X n → R and parameters ε, λ ≥ . If M : ( X n ) T → Y is ( ε, ( f , λ )) -di ff erentially private then for every ( f , λ ) -neighboring databases (cid:126)S, (cid:126)S (cid:48) ∈ ( X n ) T and every function h : Y → R we have that E y ← M ( (cid:126)S ) [ h ( y )] ≤ e − ε · E y ← M ( (cid:126)S (cid:48) ) [ h ( y )] + ( e ε − e − ε ) · E y ← M ( (cid:126)S (cid:48) ) [ | h ( y ) | ] . roof. E y ← M ( (cid:126)S ) [ h ( y )] = (cid:90) ∞ Pr y ← M ( (cid:126)S ) [ h ( y ) ≥ z ]d z − (cid:90) −∞ Pr y ← M ( (cid:126)S ) [ h ( y ) ≤ z ]d z ≤ e ε · (cid:90) ∞ Pr y ← M ( (cid:126)S (cid:48) ) [ h ( y ) ≥ z ]d z − e − ε · (cid:90) −∞ Pr y ← M ( (cid:126)S (cid:48) ) [ h ( y ) ≤ z ]d z = e − ε (cid:90) ∞ Pr y ← M ( (cid:126)S (cid:48) ) [ h ( y ) ≥ z ]d z − (cid:90) −∞ Pr y ← M ( (cid:126)S (cid:48) ) [ h ( y ) ≤ z ]d z + ( e ε − e − ε ) · (cid:90) ∞ Pr y ← M ( (cid:126)S (cid:48) ) [ h ( y ) ≥ z ]d z = e − ε · E y ← M ( (cid:126)S (cid:48) ) [ h ( y )] + ( e ε − e − ε ) · (cid:90) ∞ Pr y ← M ( (cid:126)S (cid:48) ) [ h ( y ) ≥ z ]d z ≤ e − ε · E y ← M ( (cid:126)S (cid:48) ) [ h ( y )] + ( e ε − e − ε ) · (cid:90) ∞ Pr y ← M ( (cid:126)S (cid:48) ) [ | h ( y ) | ≥ z ]d z = e − ε · E y ← M ( (cid:126)S (cid:48) ) [ h ( y )] + ( e ε − e − ε ) · E y ← M ( (cid:126)S (cid:48) ) [ | h ( y ) | ] A.1 Multi Sample Expectation Bound
Lemma A.1 (Expectation Bound) . Let D be a distribution over a domain X , let f : X n → R , and let ∆ , λ be s.t. for every ≤ i ≤ n it holds that E S ∼D n z ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) > λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ ∆ , (19) where S ( i ← z ) is the same as S except that the i th element is replaced with z . Let A : ( X n ) T → ([ T ] ∪ ⊥ ) bean ( ε, ( f , λ )) -di ff erentially private algorithm that operates on T databases of size n from X , and outputsan index ≤ t ≤ T or ⊥ . Then (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S ∼D nT t ←A ( (cid:126)S ) [ { t (cid:44) ⊥} · ( f ( D n ) − f ( S t ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( e ε − e − ε ) · λn + 6 ∆ nT . If, in addition to (19), there exists a number ≤ τ ≤ λ s.t. for every ≤ i ≤ n and every fixture of S ∈ X n we have that E y,z ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f ( S ( i ← y ) ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ( i ← y ) ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ τ, (20) Then, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S ∼D nT t ←A ( (cid:126)S ) [ { t (cid:44) ⊥} · ( f ( D n ) − f ( S t ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( e ε − e − ε ) · τn + 6 ∆ nT . We now present the proof assuming that (20) holds for some 0 ≤ τ ≤ λ . This is without loss ofgenerality, as trivially it holds for τ = λ . 18 roof of Lemma A.1. Let (cid:126)S (cid:48) = ( S (cid:48) , . . . , S (cid:48) T ) ∼ D nT be independent of (cid:126)S . Recall that each element S t of (cid:126)S is itself a vector ( x t, , . . . , x t,n ) , and the same is true for each element S (cid:48) t of (cid:126)S (cid:48) . We will sometimesrefer to the vectors S , . . . , S T as the subsamples of (cid:126)S . We define a sequence of intermediate samples that allow us to interpolate between (cid:126)S and (cid:126)S (cid:48) .Formally, for (cid:96) ∈ { , , . . . , n } define (cid:126)S (cid:96) = ( S (cid:96) , . . . , S (cid:96)T ) ∈ ( X n ) T where S (cid:96)t = ( x (cid:96)t, , . . . , x (cid:96)t,n ) and x (cid:96)t,i = (cid:40) x t,i , i > (cid:96)x (cid:48) t,i , i ≤ (cid:96) That is, every subsample S (cid:96)t of (cid:126)S (cid:96) is identical to S (cid:48) t on the first (cid:96) elements, and identical to S t thereafter. By construction we have (cid:126)S = (cid:126)S and (cid:126)S n = (cid:126)S (cid:48) . Moreover, for every t we have that S (cid:96)t and S (cid:96) − t di ff er in exactly one element. In terms of these intermediate samples we can write: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S ∼D nT E t ←A ( (cid:126)S ) [ { t (cid:44) ⊥} · ( f ( D n ) − f ( S t ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S ∼D nT E t ←A ( (cid:126)S ) (cid:34) { t (cid:44) ⊥} · (cid:32) E (cid:126)S (cid:48) ∼D nT (cid:2) f ( S (cid:48) t ) (cid:3) − f ( S t ) (cid:33)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S ∼D nT E t ←A ( (cid:126)S ) E (cid:126)S (cid:48) ∼D nT (cid:2) { t (cid:44) ⊥} · ( f ( S (cid:48) t ) − f ( S t )) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) (cid:96) ∈ [ n ] E (cid:126)S,(cid:126)S (cid:48) ∼D nT E t ←A ( (cid:126)S ) (cid:104) { t (cid:44) ⊥} · (cid:16) f ( S (cid:96)t ) − f ( S (cid:96) − t ) (cid:17)(cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E t ←A ( (cid:126)S ) (cid:104) { t (cid:44) ⊥} · (cid:16) f ( S (cid:96)t ) − f ( S (cid:96) − t ) (cid:17)(cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T E t ←A ( (cid:126)S ) (cid:104) { t (cid:44) ⊥} · (cid:16) f ( S (cid:96)t ) − f ( S (cid:96) − t ) (cid:17)(cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (21)Given a multisample (cid:126)S = ( S , . . . , S T ) ∈ ( X n ) T , a vector Z = ( z . . . , z T ) ∈ X T , and an index1 ≤ k ≤ n , we define (cid:126)S ( k ← Z ) to be the same as (cid:126)S except that the k th element of every subsample S i isreplaced with z i . Observe that by construction, for every (cid:96), Z we have (cid:126)S (cid:96), ( (cid:96) ← Z ) = (cid:126)S (cid:96) − , ( (cid:96) ← Z ) . Thus,(21) = (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T E t ←A ( (cid:126)S ) { t (cid:44) ⊥} · f ( S (cid:96)t ) − f (cid:18) S (cid:96), ( (cid:96) ← Z ) t (cid:19) − { t (cid:44) ⊥} · f ( S (cid:96) − t ) − f (cid:18) S (cid:96) − , ( (cid:96) ← Z ) t (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (22)Observer that the pairs ( (cid:126)S, (cid:126)S (cid:96) ) and (cid:16) (cid:126)S, (cid:126)S (cid:96), ( (cid:96) ← Z ) (cid:17) are identically distributed. Namely, both (cid:126)S (cid:96) and (cid:126)S (cid:96), ( (cid:96) ← Z ) agree with (cid:126)S on the last ( n − (cid:96) ) entries of every subsample, and otherwise contain i.i.d.samples from D . Hence, the expectation of (cid:18) f ( S (cid:96)t ) − f (cid:18) S (cid:96), ( (cid:96) ← Z ) t (cid:19)(cid:19) is zero, and we get(22) = (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T E t ←A ( (cid:126)S ) { t (cid:44) ⊥} · f (cid:18) S (cid:96) − , ( (cid:96) ← Z ) t (cid:19) − f ( S (cid:96) − t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (23)19bserver that the pair ( (cid:126)S (cid:96) − , (cid:126)S ) has the same distribution as ( (cid:126)S, (cid:126)S (cid:96) − ) . Specifically, the firstcomponent is nT independent samples from D and the second component is equal to the firstcomponent with a subset of the entries replaced by fresh independent samples from D . Thus,(23) = (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T E t ←A ( (cid:126)S (cid:96) − ) { t (cid:44) ⊥} · f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | ≤ λ andmax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ · E t ←A ( (cid:126)S (cid:96) − ) { t (cid:44) ⊥} · f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ ormax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (24)When max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | ≤ λ we now use the properties of algorithm A to argue that A ( (cid:126)S (cid:96) − ) ≈ A ( (cid:126)S (cid:96) ). Be Claim 3.3 we get that(24) ≤ (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | ≤ λ andmax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ · E t ←A ( (cid:126)S (cid:96) ) { t (cid:44) ⊥} · f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ( e ε − e − ε ) · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | ≤ λ andmax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ · E t ←A ( (cid:126)S (cid:96) ) (cid:20) { t (cid:44) ⊥} · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ ormax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (25)We can remove one of the two requirements in the indicator function in the middle row (thismakes the expression bigger), to get: 2025) ≤ (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | ≤ λ andmax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ · E t ←A ( (cid:126)S (cid:96) ) { t (cid:44) ⊥} · f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ( e ε − e − ε ) · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T E t ←A ( (cid:126)S (cid:96) ) (cid:34) (cid:40) max m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ (cid:41) · { t (cid:44) ⊥} · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ ormax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (26)Furthermore, we can replace (cid:26) max m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ (cid:27) in the middle row with theweaker requirement – just for the specific t that was selected by algorithm A . This yields:(26) ≤ (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | ≤ λ andmax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ · E t ←A ( (cid:126)S (cid:96) ) { t (cid:44) ⊥} · f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ( e ε − e − ε ) · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T E t ←A ( (cid:126)S (cid:96) ) (cid:20) (cid:26) | f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) | ≤ λ (cid:27) · { t (cid:44) ⊥} · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ ormax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (27)Using the fact that the pairs ( (cid:126)S, (cid:126)S (cid:96) ) and ( (cid:126)S (cid:96) , (cid:126)S ) are identically distributed, we can switch themin the middle row, to get 2127) ≤ (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | ≤ λ andmax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ · E t ←A ( (cid:126)S (cid:96) ) { t (cid:44) ⊥} · f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ( e ε − e − ε ) · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S ∼D nT E t ←A ( (cid:126)S ) E (cid:126)S (cid:48) ∼D nT Z ∼D T (cid:20) (cid:26) | f (cid:18) S (cid:96), ( (cid:96) ← Z ) t (cid:19) − f ( S (cid:96)t ) | ≤ λ (cid:27) · { t (cid:44) ⊥} · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S (cid:96), ( (cid:96) ← Z ) t (cid:19) − f ( S (cid:96)t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ ormax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (28)Using our assumptions on the function f and the distribution D (for the middle row), bringsus to:(28) ≤ (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | ≤ λ andmax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | ≤ λ · E t ←A ( (cid:126)S (cid:96) ) { t (cid:44) ⊥} · f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ( e ε − e − ε ) nτ + (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ ormax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (29)Our next task is to remove the indicator function in the first row. This is useful as the pairs (cid:16) (cid:126)S (cid:96) , (cid:126)S ( (cid:96) ← Z ) (cid:17) and ( (cid:126)S (cid:96) , (cid:126)S ) are identically distributed, and hence, if we were to remove the indicatorfunction, the first row would be equal to zero. To that end we add and subtract the first row withthe complementary indicator function (this amounts to multiplying the third row by 2). We get2229) ≤ (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T E t ←A ( (cid:126)S (cid:96) ) { t (cid:44) ⊥} · f (cid:18) S ( (cid:96) ← Z ) t (cid:19) − f ( S t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ( e ε − e − ε ) nτ + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ ormax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (30)Now the first row is 0, so(30) = ( e ε − e − ε ) nτ + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ ormax m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (31)We can replace the or condition in the indicator function with the sum of the two conditions:(31) ≤ ( e ε − e − ε ) nτ + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T (cid:34) (cid:40) max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ (cid:41) · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T (cid:34) (cid:40) max m ∈ [ T ] | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ (cid:41) · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (32)In the third row, we can replace max m ∈ [ T ] with (cid:80) m ∈ [ T ] , to get(32) ≤ ( e ε − e − ε ) nτ + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T (cid:34) (cid:40) max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ (cid:41) · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:88) m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T (cid:20) (cid:26) | f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) | > λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (33)Applying our assumptions on f and D to the third row brings us to(33) ≤ ( e ε − e − ε ) nτ + 2 nT ∆ + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T (cid:34) (cid:40) max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ (cid:41) · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (34)23he issue now is that the expression inside the indicator function is di ff erent from the expressionoutside of it. To that end, we split the indicator function as follows:(34) ≤ ( e ε − e − ε ) nτ + 2 nT ∆ + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ andmax m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ andmax m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( e ε − e − ε ) nτ + 2 nT ∆ + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T (cid:34) (cid:40) max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > λ (cid:41) · max m ∈ [ T ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) S ( (cid:96) ← Z ) m (cid:19) − f ( S m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 · (cid:88) (cid:96) ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:126)S,(cid:126)S (cid:48) ∼D nT E Z ∼D T (cid:34) (cid:40) max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | > λ (cid:41) · max m ∈ [ T ] | f ( S (cid:96) − m ) − f ( S (cid:96)m ) | (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( e ε − e − ε ) nτ + 6 nT ∆ . A.2 Multi Sample Amplification
Theorem A.2 (High Probability Bound) . Let D be a distribution over a domain X , let f : X n → R ,and let ∆ , λ, τ be s.t. for every ≤ i ≤ n it holds that E S ∼D n z ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) > λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ ∆ , and, furthermore, ∀ S ∈ X n and ∀ ≤ i ≤ n we have E y,z ∼D (cid:20) (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f ( S ( i ← y ) ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:27) · (cid:12)(cid:12)(cid:12)(cid:12) f ( S ( i ← y ) ) − f (cid:16) S ( i ← z ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ τ, where S ( i ← z ) is the same as S except that the i th element is replaced with z . Then for every ε > we havethat Pr S ∼D n [ | f ( S ) − f ( D n ) | ≥ e ε − e − ε ) τn ] < ∆ ( e ε − e − ε ) τ , provided that n ≥ O (cid:16) λε ( e ε − e − ε ) τ log (cid:16) ( e ε − e − ε ) τ ∆ (cid:17)(cid:17) roof. We only analyze the probability that ( f ( S ) − f ( D n )) is large. The analysis for ( f ( D n ) − f ( S ))is symmetric. Assume towards contradiction that with probability at least ∆ ( e ε − e − ε ) τ we have that f ( S ) − f ( D n ) ≥ e ε − e − ε ) τn . We now construct the following algorithm B that contradicts ourexpectation bound. Algorithm 2 B Input: T databases of size n each: (cid:126)S = ( S , . . . , S T ), where T (cid:44) (cid:106) ( e ε − e − ε ) τ ∆ (cid:107) .1. Set H = {⊥ , , , . . . , T } .2. For i = 1 , ..., T , define q ( (cid:126)S, i ) = f ( S i ) − f ( D n ). Also set q ( (cid:126)S, ⊥ ) = 0.3. Sample t ∗ ∈ H with probability proportional to exp (cid:16) ε λ q ( (cid:126)S, t ) (cid:17) . Output: t. The fact that algorithm B is ( ε, ( f , λ ))-di ff erentially private follows from the standard analysisof the Exponential Mechanism of McSherry and Talwar [14]. The proof appears in Claim A.4 forcompleteness.Now consider applying B on databases (cid:126)S = ( S , . . . , S T ) containing i.i.d. samples from D . By ourassumption on D and f , for every t we have that f ( S t ) − f ( D n ) ≥ e ε − e − ε ) τn with probability atleast ∆ ( e ε − e − ε ) τ . By our choice of T = (cid:106) ( e ε − e − ε ) τ ∆ (cid:107) , we therefore getPr (cid:126)S ∼D nT (cid:34) max t ∈ [ T ] { f ( S t ) − f ( D n ) } ≥ e ε − e − ε ) τn (cid:35) ≥ − (cid:32) − ∆ ( e ε − e − ε ) τ (cid:33) T ≥ . The probability is taken over the random choice of the examples in (cid:126)S according to D . Thus, byMarkov’s inequality, E (cid:126)S ∼D nT (cid:20) max t ∈ H (cid:110) q ( (cid:126)S, t ) (cid:111)(cid:21) = E (cid:126)S ∼D nT (cid:34) max (cid:40) , max t ∈ [ T ] ( f ( S t ) − f ( D )) (cid:41)(cid:35) ≥ e ε − e − ε ) τn. (35)So, in expectation, max t ∈ H (cid:16) q ( (cid:126)S, t ) (cid:17) is large. In order to contradict the expectation bound ofTheorem A.2, we need to show that this is also the case for the index t ∗ that is sampled on Step 3. Tothat end, we now use the following technical claim, stating that the expected quality of a solutionsampled as in Step 3 is high. Claim A.3 (e.g., [2]) . Let H be a finite set, h : H → R a function, and η > . Define a random variable Y on H by Pr[ Y = y ] = exp( ηh ( y )) /C , where C = (cid:80) y ∈ H exp( ηh ( y )) . Then E [ h ( Y )] ≥ max y ∈ H h ( y ) − η ln | H | . For every fixture of (cid:126)S , we can apply Claim A.3 with h ( t ) = q ( (cid:126)S, t ) and η = ε λ to get E t ∗ ∈ R H [ q ( (cid:126)S, t ∗ )] = E t ∗ ∈ R H (cid:20) { t ∗ (cid:44) ⊥} · ( f ( S t ∗ ) − f ( D n )) } (cid:21) ≥ max { , max t ∈ [ T ] ( f ( S t ) − f ( D n )) } − λε ln( T + 1) . Taking the expectation also over (cid:126)S ∼ D nT we get that E (cid:126)S ∼D nT t ∗ ←B (cid:16) (cid:126)S (cid:17) (cid:20) { t ∗ (cid:44) ⊥} · ( f ( S t ∗ ) − f ( D n )) } (cid:21) ≥ E (cid:126)S ∼D nT (cid:34) max (cid:40) , max t ∈ [ T ] ( f ( S t ) − f ( D n )) (cid:41)(cid:35) − λε ln( T + 1) ≥ e ε − e − ε ) τn − λε ln( T + 1) . n > λε ( e ε − e − ε ) τ ln( T + 1) = λε ( e ε − e − ε ) τ ln( ( e ε − e − ε ) τ ∆ + 1). Claim A.4.
Algorithm B is ( ε, ( f , λ )) -di ff erentially private.Proof. Fix two ( f , λ )-neighboring databases (cid:126)S and (cid:126)S (cid:48) , and let b ∈ {⊥ , , , . . . , T } be a possible output.We have that Pr[ B ( (cid:126)S ) = b ] = exp( ε λ · q ( (cid:126)S, b )) (cid:80) a ∈ H exp( ε λ · q ( (cid:126)S, a )) (36)Using the fact that (cid:126)S and (cid:126)S (cid:48) are ( f , λ )-neighboring, for every a ∈ H we get that q ( (cid:126)S (cid:48) , a ) − λ ≤ q ( (cid:126)S, a ) ≤ q ( (cid:126)S (cid:48) , a ) + λ . Hence, (36) ≤ exp( ε λ · [ q ( (cid:126)S (cid:48) , b ) + λ ]) (cid:80) a ∈ H exp( ε λ · [ q ( (cid:126)S (cid:48) , a ) − λ ])= e ε/ · exp( ε λ · q ( (cid:126)S (cid:48) , b )) e − ε/ (cid:80) a ∈ H exp( ε λ · q ( (cid:126)S (cid:48) , a ))= e ε · Pr[ B ( (cid:126)S (cid:48) ) = b ] . For any possible set of outputs B ⊆ {⊥ , , , . . . , T } we now have thatPr[ B ( (cid:126)S ) ∈ B ] = (cid:88) b ∈ B Pr[ B ( (cid:126)S ) = b ] ≤ (cid:88) b ∈ B e ε · Pr[ B ( (cid:126)S (cid:48) ) = b ] = Pr[ B ( (cid:126)S (cid:48) ) ∈ B ] ..