A workload-adaptive mechanism for linear queries under local differential privacy
AA workload-adaptive mechanism for linear queries underlocal differential privacy
Ryan McKenna Raj Kumar Maity Arya Mazumdar Gerome Miklau
College of Information and Computer SciencesUniversity of Massachusetts Amherst { rmckenna, rajkmaity, arya, miklau } @cs.umass.edu ABSTRACT
We propose a new mechanism to accurately answer a user-provided set of linear counting queries under local differentialprivacy (LDP). Given a set of linear counting queries (theworkload) our mechanism automatically adapts to provideaccuracy on the workload queries. We define a parametricclass of mechanisms that produce unbiased estimates of theworkload, and formulate a constrained optimization prob-lem to select a mechanism from this class that minimizesexpected total squared error. We solve this optimizationproblem numerically using projected gradient descent andprovide an efficient implementation that scales to large work-loads. We demonstrate the effectiveness of our optimization-based approach in a wide variety of settings, showing that itoutperforms many competitors, even outperforming existingmechanisms on the workloads for which they were intended.
1. INTRODUCTION
In recent years, Differential Privacy [15] has emerged as thedominant approach to privacy and its adoption in practicalsettings is growing. Differential privacy is achieved withcarefully designed randomized algorithms, called mechanisms.The aim of these mechanisms is to extract utility from thedata while adhering to the constraints imposed by differentialprivacy. Utility is measured on a task-by-task basis, anddifferent tasks require different mechanisms. Utility-optimalmechanisms, or mechanisms that maximize utility for a giventask under a fixed privacy budget, are still not known inmany cases.There are two main models of differential privacy: thecentral model and the local model. In the central model,users provide their data to a trusted data curator, who runs aprivacy mechanism on the dataset in its entirety. In the localmodel, users execute a privacy mechanism before sending itto the data curator. Local differential privacy (LDP) offers astronger privacy guarantee than central differential privacy,as it does not rely on the assumption of a trusted datacurator. For that reason, it has been embraced by severalorganizations like Google [18], Apple [38], and Microsoft [14]for the collection of personal data from customers. While thestronger privacy guarantee is a benefit of the local model, itnecessarily leads to greater error than the central model [16],which makes error-optimal mechanisms an important goal.Our focus is answering a workload of linear counting queriesunder local differential privacy. Answering a query workloadis a general task that subsumes other common tasks, likeestimating histograms, range queries, and marginals. Fur-thermore, the expressivity of linear query workloads goes far beyond these special cases, as it can include an arbitraryset of predicate counting queries. By defining the work-load, the analyst expresses the exact queries they care aboutmost, and their relative importance. There are several LDPmechanisms for answering particular fixed workloads, likehistograms [1, 45, 5, 41], range queries [13, 42], and marginals[12, 42]. These mechanisms were carefully crafted to provideaccuracy on the workloads for which they were designed, buttheir accuracy properties typically do not transfer to otherworkloads. Some LDP mechanisms are designed to answeran arbitrary collection of linear queries [4, 17], but they donot outperform simple baselines in practice.In this paper, we propose a new mechanism that auto-matically adapts in order to prioritize accuracy on a targetworkload. Adaptation to the workload is accomplished bysolving a numerical optimization problem, in which we searchover an expressive class of unbiased LDP mechanisms forone that minimizes variance on the workload queries.Workload-adaptation [21, 27, 33] is a much more devel-oped topic in the central model of differential privacy andhas led to mechanisms that offer best-in-class error rates insome settings [22]. Our work is conceptually similar to theMatrix Mechanism [27, 33], which also minimizes varianceover a class of unbiased mechanisms. However, because theclass of mechanisms we consider is different, the optimiza-tion problem is fundamentally different and requires a novelanalysis and algorithmic solution. We thoroughly discuss thesimilarities and differences between these two mechanismsin Section 7.
Contributions.
The paper consists of four main technicalcontributions. • We propose a new class of mechanisms, called work-load factorization mechanisms , that generalizes manyexisting LDP mechanisms, and we formulate an opti-mization problem to select a mechanism from this classthat is optimally tailored to a workload (Section 3). • We give an efficient algorithm to approximately solvethis optimization problem, by reformulating it into analgorithmically tractable form (Section 4). • We provide a theoretical analysis which illuminateserror properties of the mechanism and justifies thedesign choices we made (Section 5). • In a comprehensive set of experiments we test ourmechanism on a range workloads, showing that it con-sistently delivers lower error than competitors, by asmuch as a factor of 14 . a r X i v : . [ c s . D B ] M a y . BACKGROUND AND PROBLEM SETUP In this section we introduce notation for the input dataand query workload as well as provide basic definitions oflocal differential privacy. A full review of notation is providedin Table 2 of the Appendix.
Given a domain U of n distinct user types ,the input datais a collection of N users (cid:104) u , . . . , u N (cid:105) , where each u i ∈ U .We commonly use a vector representation of the input data,containing a count for each possible user type: Definition 2.1 (Data Vector)
The data vector, denotedby x , is an n -length vector of counts indexed by user types u ∈ U such that: x u = N (cid:88) j =1 { u j = u } ∀ u ∈ U . In the local model, we do not have direct access to x , but itis still useful to define it for the purpose of analysis. Belowis a simple data vector one might obtain from a data set ofstudent grades. Example 2.2 (Student Data)
Consider a data set of stu-dent grades, where U = { A, B, C, D, F } . Suppose studentsgot an A , students got a B , students got a C and nostudents got a D or F . Then the data vector would be: x = (cid:2)
10 20 5 0 0 (cid:3) T Linear counting queries have a similar vector representa-tion, as shown in Definition 2.3.
Definition 2.3 (Linear Query)
A linear counting queryis an n -length vector w indexed by user types u ∈ U , suchthat the answer to the query is w T x = (cid:80) u ∈U w u x u . A workload is a collection p linear queries w , . . . , w p ∈ R n organized into a p × n matrix W . Our goal is to accuratelyestimate answers to each workload query under local differ-ential privacy, i.e., we want to privately estimate Wx .The most commonly studied workload is the so-calledHistogram workload, which is represented by a n × n identitymatrix. A more interesting workload is given below: Example 2.4 (Prefix workload)
The prefix workload con-tains queries that compute the (unnormalized) empirical cu-mulative distribution function of the data, or the number ofstudents that have grades ≥ A, ≥ B, ≥ C, ≥ D, ≥ F . W = The workload W is an input to our algorithm, reflecting thequeries of interest to the analyst and therefore determiningthe measure of utility that will be used to assess algorithmperformance. In this setup, we make no assumptions aboutthe structure or contents of W , and allow it to be completelyarbitrary, even including the same query multiple times ormultiple linearly dependent queries. Local differential privacy (LDP) [16] is a property of arandomized mechanism M that acts on user data. Insteadof reporting their true user type, users instead report arandomized response obtained by executing M on their trueinput. These randomized responses allow an analyst tolearn something about the population as a whole, while stillproviding the individual users a form of plausible deniabilityabout their true input. The formal requirement on M isstated in Definition 2.5. Definition 2.5 (Local Differential Privacy)
A random-ized mechanism M : U → O is said to be (cid:15) -LDP if an onlyif, for all u, u (cid:48) ∈ U and all S ⊆ O : Pr[ M ( u ) ∈ S ] ≤ exp ( (cid:15) ) Pr[ M ( u (cid:48) ) ∈ S ]The output range O can vary between mechanisms. Insome simple cases, it is the same as the input domain U ,but it does not have to be. Choosing the output range istypically the first step in designing a mechanism. Whenthe range of the mechanism is finite, i.e., |O| = m , we cancompletely specify a mechanism by a so-called strategy matrix Q ∈ R m × n , indexed by ( o, u ) ∈ O × U . The mechanism M Q ( u ) is then defined by:Pr[ M Q ( u ) = o ] = Q o,u This encoding of a mechanism essentially stores a proba-bility for every possible input, output pair in the strategymatrix Q . We translate Definition 2.5 to strategy matricesin Proposition 2.6. Proposition 2.6 (Strategy Matrix)
The mechanism M Q is (cid:15) -LDP if and only if the following conditions are satisfied:1. Q o,u ≤ exp ( (cid:15) ) Q o,u (cid:48) , ∀ o, u, u (cid:48) . Q o,u ≥ , ∀ o, u , (cid:80) o Q o,u = 1 , ∀ u. Above, the first condition is the privacy constraint, ensuringthat the output distributions for any two users are close, andthe second is the probability simplex constraint, ensuringthat each column of Q corresponds to a valid probabilitydistribution. Representing mechanisms as matrices is usefulbecause it allows us to reason about them mathematicallywith linear algebra [26, 24]. Example 2.7 shows how a simplemechanism, called randomized response , can be encoded asa strategy matrix. Example 2.7 (Randomized Response)
The randomizedresponse mechanism [44] can be encoded as a strategy matrixin the following way: Q = 1 e (cid:15) + n − e (cid:15) . . . e (cid:15) . . . ... ... . . . ... . . . e (cid:15) For this mechanism, the output range is the same as theinput domain, and hence the strategy matrix is square. Thediagonal entries of the strategy matrix are proportional to e (cid:15) , and the off-diagonal entries are proportional to 1. This the name of this mechanism should not be confused withthe outputs of an arbitrary mechanism M , which we alsocall randomized responses.2echanism Input Output Strategy Matrix Randomized Response [44] u ∈ [ n ] o ∈ [ n ] Q o,u ∝ (cid:40) exp ( (cid:15) ) o = u o (cid:54) = u RAPPOR [18] u ∈ [ n ] o ∈ { , } n Q o,u ∝ exp (cid:16) (cid:15) (cid:17) n −(cid:107) o − e u (cid:107) Hadamard [1] u ∈ [ n ] o ∈ [ K ] K = [2 (cid:100) log ( n +1) (cid:101) ] Q o,u ∝ (cid:40) exp ( (cid:15) ) H o +1 ,u = 11 H o +1 ,u = − Subset Selection [45] u ∈ [ n ] o ∈ { , } n (cid:107) o (cid:107) = d Q o,u ∝ (cid:40) exp ( (cid:15) ) o u = 11 o u = 0Table 1: Existing LDP mechanisms encoded as a strategy matrix. e u is the one-hot encoding of u , H is the K × K Hadamardmatrix, and d is a hyper-parameter.means that each user reports their true input with proba-bility proportional to e (cid:15) and all other possible outputs withprobability proportional to 1. It is easy to see that the con-ditions of Proposition 2.6 are satisfied. While this is one ofthe simplest mechanisms, many other mechanisms can alsobe represented in this way.For example, Table 1 shows how RAPPOR [18], Subset Se-lection [45] and Hadamard [2] can be expressed as a strategymatrix. Other mechanisms with more sophisticated struc-ture, such as Hierarchical [13, 42] and Fourier [12], can alsobe expressed as a strategy matrix, but they require too muchnotation to explain in Table 1. A strategy matrix is simplya direct encoding of a conditional probability distribution,where every probability is explicitly enumerated as an entryof the matrix. Hence, the representation can encode any LDP mechanism, as long as the U and O are both finite andthe conditional probabilities can be calculated.When executing the mechanism, each user reports a (ran-domized) response o i = M Q ( u i ). When all users randomizetheir data with the same mechanism, these responses aretypically aggregated into a response vector y ∈ R m (indexedby elements o ∈ O ), where y o = (cid:80) Nj =1 { o j = o } . Much likethe data vector, the response vector is essentially a histogramof responses, as it counts the number of users who reportedeach response. In the sequel, it is useful to think of themechanism M Q as being a function from x to y instead of u i to o i . Thus, for notational convenience, we overload thedefinition of M Q , allowing it to consume a data vector andreturn a response vector, so that M Q ( x ) = y .The response vector y is often not that useful by itself,but it can be used to estimate more useful quantities, suchas the data vector x or the workload answers Wx . This istypically achieved with a post-processing step, and does notimpact the privacy guarantee of the mechanism.
3. THE FACTORIZATION MECHANISM
In this section, we describe our mechanism and the mainoptimization problem that underlies it. We begin with a high-level problem statement, and reason about it analyticallyuntil it is in a form we can deal with algorithmically. Wepresent our key ideas and the main steps of our derivation,but defer the finer details of the proofs to Appendix B.Our goal is to find a mechanism that has low expected erroron the workload. This objective is formalized in Problem 3.1.
Problem 3.1 (Workload Error Minimization)
Given aworkload W , design an (cid:15) -LDP mechanism M ∗ that mini-mizes worst-case expected L squared error. Formally, M ∗ = arg min M (cid:8) max x E [ (cid:107) Wx − M ( x ) (cid:107) ] (cid:9) In the problem statement above, our goal is to search throughthe space of all (cid:15) -LDP mechanisms for the one that is bestfor the given workload. Because it is difficult to characterizean arbitrary mechanism M in a way that makes optimizationpossible, we do not solve the above problem in its full gener-ality. Instead, we perform the search over a restricted class ofmechanisms which is easier to characterize. While somewhatrestricted, this class of mechanisms is quite expressive, andit captures many of the state-of-the-art LDP mechanismsavailable today [18, 45, 2, 12, 13, 42]. Definition 3.2 (Workload Factorization Mechanism)
Given an (cid:15) -LDP strategy matrix Q ∈ R m × n and a reconstruc-tion matrix V ∈ R p × m such that W = VQ , the WorkloadFactorization Mechanism (factorization mechanism for short)is defined as: M V , Q ( x ) = V M Q ( x )Note that M V , Q is defined in terms of M Q , and it is param-eterized by an additional reconstruction matrix V as well.This reconstruction matrix is used to estimate the workloadquery answers from the response vector output by M Q . Infact, the workload query estimates produced by this class ofmechanisms is unbiased, as: E [ M V , Q ( x )] = V E [ M Q ( x )] = VQx = Wx . Furthermore, M V , Q inherits the privacy guarantee of M Q by the post-processing principle of differential privacy [16]. Remark 1 (Consistency)
Because these mechanisms areunbiased, they will produce the correct workload answers inexpectation. However, the individual estimates produced bythe mechanism may not be the true workload answers for any underlying dataset, which is a consistency problem. Forexample, the estimates might suggest that one or more entriesof the data vector are negative, which is clearly impossible. Toaddress this problem we show our mechanism can be improvedthrough a post-processing technique that produces consistentestimates to the workload queries that are as close as possibleto the unbiased estimates. This idea is not new, but anadaptation of existing techniques [35, 30], so we defer the fulldescription to the appendix. We nevertheless demonstrate itsbenefits experimentally in Section 6.7.We consider this an extension of our mechanism that canimprove utility in practice, and evaluate it in isolation. Forthe remainder of the paper, we will focus on the original,unbiased mechanism, as it is substantially easier to reasonabout analytically.
Many existing LDP mechanisms can be represented as afactorization mechanism. For example, we show how the3andomized Response mechanism can be expressed as afactorization mechanism in Example 3.3.
Example 3.3 (Randomized Response)
The randomizedresponse mechanism uses Q as defined in Example 2.7 and V = Q − to estimate the Histogram workload ( W = I ). V = 1 e (cid:15) − e (cid:15) + n − − . . . − − e (cid:15) + n − . . . − ... ... . . . ... − − . . . e (cid:15) + n − While the randomized response mechanism is intended tobe used to answer the Histogram workload, there is no reasonwhy it cannot be used for other workloads as well. In fact,it is quite straightforward to see how it can be extended toanswer an arbitrary workload, simply by using V = WQ − . While the factorization mechanism is unbiased for anyworkload factorization, different factorizations lead to dif-ferent amounts of variance on the workload answers. Thiscreates the opportunity to choose the workload factorizationthat leads to the lowest possible total variance. In order todo that, we need an analytic expression for the total variancein terms of V and Q , which we derive in Theorem 3.4. Theorem 3.4 (Variance)
The expected total squared error(total variance) of a workload factorization mechanism is: E [ (cid:107) Wx − M V , Q ( x ) (cid:107) ] = (cid:88) u ∈U x u p (cid:88) i =1 v Ti Diag ( q u ) v i − ( v Ti q u ) where q u denotes column u of Q and v Ti denotes row i of V . Notice above that the exact expression for variance dependson the data vector x , which we do not have access to, as itis a private quantity. We want our mechanism to work wellfor all possible x , so we consider worst-case variance and arelaxation average-case variance instead. Corollary 3.5 (Worst-case variance)
The worst-case vari-ance of M V , Q occurs when all users have the same worstcase type (i.e., x u = N for some u ), and is: L worst ( V , Q ) = N max u ∈U p (cid:88) i =1 v Ti Diag ( q u ) v i − ( v Ti q u ) . Corollary 3.6 (Average-case variance)
The average-casevariance of M V , Q occurs when x u = Nn for all u and is: L avg ( V , Q ) = Nn (cid:88) u ∈U p (cid:88) i =1 v Ti Diag ( q u ) v i − ( v Ti q u ) . With these analytic expressions for variance, we can ana-lyze and compare existing mechanisms that can be expressedas a workload factorization mechanism. The variance forrandomized response is shown in Example 3.7.
Example 3.7 (Variance of Randomized Response)
Theworst-case and average-case variance of the factorization inExample 3.3 on the Histogram workload is: L worst ( V , Q ) = L avg ( V , Q ) = N ( n − (cid:104) n ( e (cid:15) − + 2 e (cid:15) − (cid:105) Alternatively, if we had a prior distribution over x , we coulduse that to estimate variance. The expression above is obtained by simply plugging in V and Q to the equations above and simplifying. Interestingly,the worst-case and average-case variance are the same for thisworkload factorization due to the symmetry in the workloadand strategy matrices. With an analytic expression for variance, we can statethe optimization problem underlying the factorization mech-anism. Our goal is to find a workload factorization thatminimizes the total variance on the workload. To do that,we set up an optimization problem, using total variance asthe objective function while taking into consideration theconstraints that have to hold on V and Q . This is formalizedin Problem 3.8. Problem 3.8 (Optimal Factorization)
Given a workload W and a privacy budget (cid:15) :minimize V , Q L ( V , Q ) subject to W = VQ (cid:80) o Q o,u = 1 ∀ u ≤ Q o,u ≤ exp ( (cid:15) ) Q o,u (cid:48) ∀ o, u, u (cid:48) . Above, L is a loss function that captures how good a givenfactorization is, such as the worst-case variance L worst or theaverage-case variance L avg . While our original objective inProblem 3.1 was to find the mechanism that optimizes worst-case variance, for practical reasons we use the average-casevariance instead. The average-case variance is a smooth ap-proximation of the worst-case variance, which leads to a moreanalytically and computationally tractable problem. Addi-tionally, the smoothness of the average-case variance makesthe corresponding optimization problem more amenable tonumerical methods. We study the ramifications of this re-laxation theoretically in Section 5.1.When using L avg asthe objective function, we observe it can be expressed ina much simpler form using matrix operations, as shown inTheorem 3.9. Theorem 3.9 (Objective Function)
The objective func-tion L ( V , Q ) = tr [ VD Q V T ] is related to L avg ( V , Q ) by: L avg ( V , Q ) = Nn (cid:0) L ( V , Q ) − (cid:107) W (cid:107) F (cid:1) , where D Q = Diag ( Q1 ) and tr [ · ] is the trace of a matrix. From now on, when we refer to L ( V , Q ), we are referring toits definition in Theorem 3.9. The new objective is equivalentto L avg up to constant factors, and hence can be used inplace of it for the purposes of optimization.With this simplified objective function, we observe thatfor a fixed strategy matrix Q , we can compute the optimal V in closed form. If the entries of the response vector werestatistically independent and had equal variance, then thiswould simply be V = WQ † , where Q † is the Moore-Penrosepseudo-inverse of Q [11, 23]. However, since the entriesof the response vector have unequal variance and are notstatistically independent in general, this simple expressionis not correct. We can still express the optimal V in closedform, however, as shown in Theorem 3.10. Theorem 3.10 (Optimal V for fixed Q)
For a fixed Q ,the minimizer of L ( V , Q ) subject to W = VQ is given by: V = W ( Q T D − Q Q ) † Q T D − Q , D Q is invertible without lossof generality. If it were not, then one entry of the diagonalwould have to be 0, implying that a row of Q is all zero. Sucha row corresponds to an output that never occurs under theprivacy mechanism, and can be removed without changingthe mechanism. Further note that for the above formula toapply, there must exist a V such that W = VQ , which isguaranteed if and only if W is in the row space of Q [37].Expressed as a constraint, this is W = WQ † Q .Now that we know the optimal V for any Q , we can plugit into L ( V , Q ) to express the objective as a function of Q alone. Doing this, and simplifying further, leads to our finaloptimization objective, stated in Theorem 3.11. Theorem 3.11 (Objective Function for Q)
The objec-tive function can be expressed as: L ( Q ) = tr [( Q T D − Q Q ) † ( W T W )] .L ( Q ) is our final optimization objective. We are almostready to restate the optimization problem in terms of Q .However, before we do that, it is useful to simplify theconstraints of the problem. The constraints stated in Prob-lem 3.8 are challenging to deal with algorithmically becausethere are a large number of them. Ignoring the factorizationconstraint, there are n m + n constraints on Q , and eachentry of Q is constrained by entries from the same column and entries from the same row.By introducing an auxiliary optimization variable z ∈ R m ,we reduce this to nm + n constraints, so that each entry of Q is only constrained by entries from the same column and z .Specifically, z corresponds to the minimum allowable valueon each row of Q , and every column of Q must be between z and exp ( (cid:15) ) z (coordinate-wise inequality). It is clear thatthis is exactly equivalent to Condition 2 in Proposition 2.6.Also note that Condition 1 can be expressed in matrix formas Q T = . The final optimization problem underlying theworkload factorization mechanism is stated in Problem 3.12. Problem 3.12 (Strategy Optimization)
Given a work-load W and a privacy budget (cid:15) :minimize Q , z tr [( Q T D − Q Q ) † ( W T W )] subject to W = WQ † QQ T = ≤ z ≤ q u ≤ exp ( (cid:15) ) z ∀ u.
4. OPTIMIZATION ALGORITHM
We now discuss our approach to solving Problem 3.12. Itis a nonlinear optimization problem with linear and nonlinearconstraints. While the objective is smooth (and hence differ-entiable) within the boundary of the constraint W = WQ † Q ,it is not convex. It is typically infeasible to find a closedform solution to such a problem, and conventional numericaloptimization methods are not guaranteed to converge to aglobal minimum for non-convex objectives. However, suchnumerical gradient-based methods have seen remarkable em-pirical success in a variety of domains, often finding highquality local minima. That is the approach we take, however,rather than use an out-of-the-box commercial solver, whichwould not be able to scale to larger problem sizes, we pro-vide our own optimization algorithm which achieves greaterscalability by exploiting structure in the constraint set. Algorithm 1
Projection onto bounded probability simplex
Input: r , z ∈ R m , (cid:15)(cid:46) sorted vector and corresponding permutation u , π = sort([ z − r , exp ( (cid:15) ) z − r ]) a = (cid:2) ( − [ π i ≤ m ] (cid:3) i =1 ... m b = (cid:104)(cid:80) i − j =1 a j (cid:105) i =1 ... m ρ = min { ≤ i ≤ m : T z + b i u i + (cid:80) i − j =1 a j u j > } λ = (1 − T z − b ρ u ρ − (cid:80) ρ − j =1 a j u j ) /b ρ + u ρ return clip( r + λ, z , exp ( (cid:15) ) z )The algorithm we propose is an instance of projected gra-dient descent [36], a variant of gradient descent that handlesconstraints. To implement this algorithm, the key challengeis to project onto the constraint set. In other words, givena matrix R that does not satisfy the constraints, find the“closest” matrix Q that does satisfy the constraints. Ignoringthe constraint W = WQ † Q for now, this sub-problem isstated formally in Problem 4.1. Problem 4.1 (Projection onto LDP Constraints)
Givenan arbitrary matrix R , a vector z , and a privacy budget (cid:15) , theprojection onto the privacy constraint, denoted Q = Π z ,(cid:15) ( R ) is obtained by solving the following problem:minimize Q (cid:107) Q − R (cid:107) F subject to Q T = ≤ q u ≤ exp ( (cid:15) ) z ∀ u. Problem 4.1 is easier to solve than Problem 3.12 becausethe objective is now a quadratic function of Q . In addition,a key insight to solve this problem efficiently is to noticethat it is closely related to the problem of projecting ontothe probability simplex [43] (now with bound constraints),and admits a similar solution. Specifically, the form of thesolution is stated in Proposition 4.2. Proposition 4.2 (Projection Algorithm)
The solutionto Problem 4.1 may be obtained one column at a time using q u = clip ( r u + λ u , z , exp ( (cid:15) ) z ) , where clip “clips” the entries of r u + λ u to the range [ z , exp ( (cid:15) ) z ] entry-wise and λ u is a scalar value that makes T q u = 1 . The solution is remarkably simple. Intuitively, we addthe same scalar value to every entry of r u then clip thosevalues that lie outside the allowed range. The scalar value λ u is the Lagrange multiplier on the constraint T q u = 1,and is chosen so that q u satisfies the constraint. It may becalculated through binary search or any other method to findthe root of the function T q u − O ( m log ( m )) algorithm to find λ u and q u in Algorithm 1.Now that we have discussed the projection problem andits solution, we are ready to state the full projected gradientdescent algorithm for finding an optimized strategy. Algo-rithm 2 is an iterative algorithm, where in each iterationwe perform a gradient descent plus projection step on theoptimization parameters z and Q . The gradient ∇ Q L iseasily obtained as L is a function of Q , but the gradientterm ∇ z L is less obvious. However, by observing that Q isactually a function of z (from the projection step Π z,(cid:15) ), wecan use the multi-variate chain rule to back-propagate the5 lgorithm 2 Strategy optimization
Input:
Workload W ∈ R p × n , privacy budget (cid:15) Intialize Q ∈ R m × n , z ∈ R m , β ∈ R + α = β/n exp ( (cid:15) ) for t = 1 , . . . , T doz ← clip( z − α ∇ z L ( Q ) , , ) Q ← Π z ,(cid:15) ( Q − β ∇ Q L ( Q )) end forreturn Q gradient from Q to z to obtain ∇ z L . We do not discuss thedetails of computing the gradients here, as it can be easilyaccomplished with automatic differentiation tools [20, 32].We note that Algorithm 2 handles the constraint W = WQ † Q “for free” in the sense that we do not need to dealwith it explicitly, as long as the step sizes and initializationare chosen appropriately. Specifically, as long as the initial Q satisfies the constraint, and the step sizes are sufficientlysmall, every subsequent Q in the algorithm will also satisfythe constraint. Intuitively, this is because as we move closerto the boundary of the constraint, the objective functionblows up and eventually reaches a point of discontinuity whenthe constraint is not satisfied. Because we update using thenegative gradient, which is a descent direction, we will neverapproach the boundary of the constraint. We discuss thechoice of step size and initialization below. This trick is avery nice way to deal with a constraint that is otherwisechallenging to deal with. We note that similar ideas havebeen used to deal with related constraints in prior work [46].The step size for the gradient descent step must be suppliedas input, and two different step sizes are used to update Q and z . Notice that we take a smaller step size to update z than Q . This is a heuristic we use to make sure z doesn’tchange too fast; it improves the robustness of the algorithm.We perform a hyper-parameter search to find a step size thatworks well, only running the algorithm for a few iterations inthis phase, then running it longer once a step size is chosen.Decaying the step size at each iteration is also possible, assmaller step sizes typically work better in later iterations.The final missing piece in Algorithm 2 is the initializationof Q , for which there are multiple options. One option is toinitialize with the strategy matrix from an existing mecha-nism, such as the best one from Table 1. Then intuitively theoptimized strategy will never be worse than the other mech-anisms, because the negative gradient is a descent direction.This is an informal argument, as it is technically possiblethat the optimized strategy has better average-case variancebut worse worst-case variance than the initial strategy. Wedo not take this approach, however, as we find initializing Q randomly tends to work better. Specifically, we let R bea random 4 n × n matrix, where each entry is sampled from U [0 , Q by projecting onto the constraintset; i.e., Q = Π z ,(cid:15) ( R ), where z = e − (cid:15) n , where is avector of ones. The choice of m is also an important consid-eration when initializing Q . While larger m leads to a moreexpressive strategy space, it also leads to more expensiveoptimization. Our choice of m = 4 n represents a sweet spotthat we found works well empirically across a variety of work-loads. In general, a hyper-parameter search can be executedto find the best m . This hyper-parameter search does notdegrade the privacy guarantee in any way because we can evaluate the quality of a strategy without consuming theprivacy budget, by using the analytic formulas for variance.It requires O ( n m + n ) time to evaluate the objectivefunction and its gradient (assuming W T W has been pre-computed) and O ( nm log m ) time to perform the projection.Thus, the per-iteration time complexity of Algorithm 2 is O ( n m + n + nm log m ), or O ( n ) when using m = 4 n .
5. THEORETICAL RESULTS
In this section, we answer several theoretical questionsabout our mechanism. First, we justify the relaxation in theobjective function, used to make the optimization analyti-cally tractable. Second, we theoretically analyze the errorachieved by our mechanism, measured in terms of samplecomplexity. Third, we derive lower bounds on the achievableerror of workload factorization mechanisms. All the proofsare deferred to Appendix B.
In Section 3 we replaced our true optimization objective L worst with a relaxation L avg . In this section, we justify thatchoice theoretically, showing that L worst is tightly boundedabove and below by L avg . Theorem 5.1 (Bounds on L worst ) Let W = VQ be anarbitrary factorization of W where Q is an (cid:15) -LDP strategymatrix. Then the worst case variance L worst ( V , Q ) andaverage-case variance L avg ( V , Q ) are related as follows: L avg ( V , Q ) ≤ L worst ( V , Q ) ≤ e (cid:15) (cid:0) L avg ( V , Q ) + Nn || W || F (cid:1) Theorem 5.1 suggests that relaxing L worst to L avg does notsignificantly impact the optimization problem. Intuitively,this theorem holds because of the privacy constraint on Q ,which guarantees that the column of Q for the worst-caseuser cannot be too different from any other column. Hence,all users must have a similar impact on the total variance ofthe mechanism. Empirically, we find that L worst is often evencloser to L avg than the upper bound suggests. Furthermore,in some cases L worst is exactly equal to L avg , as we showedin Example 3.7. We gave an analytic expression for the expected totalsquared error (total variance) of our mechanism in Corol-lary 3.5. However, this quantity might be difficult to inter-pret, and it is more natural to look at the number of samplesneeded to achieve a fixed error instead. Furthermore, whenrunning an LDP mechanism it is important to know howmuch data is required to obtain a target error rate, as thatinformation is critical for determining an appropriate privacybudget.Because the total variance increases with the number ofindividuals N and the number of workload queries m , weinstead look at a normalized measure of variance. Definition 5.2 (Normalized Variance)
The normalizedworst-case variance of M V , Q is: L norm ( V , Q ) = max x E (cid:104) m (cid:13)(cid:13)(cid:13)(cid:13) N (cid:8) Wx − M V , Q ( x ) (cid:9)(cid:13)(cid:13)(cid:13)(cid:13) (cid:105) L norm is the same as L worst up to constant factors, al-though it is more interpretable because it is a measure of6ariance on a single “average” workload query, where varianceis measured on the normalized data vector x /N . Corollary 5.3 (Normalized Variance)
The normalizedvariance is: L norm ( V , Q ) = 1 mN L worst ( V , Q )= 1 mN max u ∈U p (cid:88) i =1 v Ti Diag ( q u ) v i − ( v Ti q u ) Interestingly, the dependence on N does not change with V and Q — it is always Θ( N ), but the constant factor dependson the quality of the workload factorization. Corollary 5.4 (Sample Complexity)
The number of sam-ples needed to achieve normalized variance α is: N ≥ mα max u ∈U p (cid:88) i =1 v Ti Diag ( q u ) v i − ( v Ti q u ) We can readily compute the sample complexity numericallyfor any factorization VQ . In fact, the sample complexityand worst-case variance of a mechanism are proportional, asevident from the above equation. Additionally, by replacing L worst ( V , Q ) with a lower bound, we can get an analyticalexpression for the sample complexity in terms of the privacybudget (cid:15) and the properties of the workload W . Example 5.5 (
Sample complexity, Randomized Response)
The Randomized Response mechanism described in Exam-ple 3.3 has sample complexity: N ≥ ( n − αn (cid:104) n ( e (cid:15) − + 2 e (cid:15) − (cid:105) on the Histogram workload. Example 5.5 suggests the sample complexity of the ran-domized response mechanism grows roughly at a linear ratewith the domain size n . For a given workload, a theoretical lower bound on theachievable error is useful for checking how close to optimalour strategies are. It also can be used to characterize theinherent difficulty of the workload. In this section, we derivean easily-computable lower bound on the achievable errorunder our mechanism in terms of the singular values of theworkload matrix.
Theorem 5.6 (
Lower Bound, Factorization Mechanism)
Let W be a workload matrix and let Q be any (cid:15) -LDP strategymatrix. Then we have: (cid:15) ) ( λ + · · · + λ n ) ≤ L ( Q ) where λ , . . . , λ n are the singular values of W and L ( Q ) isthe loss function defined in Theorem 3.11. This result is similar to lower bounds known to hold in thecentral model of differential privacy, based on the analysis ofthe Matrix Mechanism [29]. In both cases the hardness of aworkload is characterized by its singular values.Other lower bounds for this problem have characterizedthe hardness of a workload in terms of quantities like thelargest L column norm of W [4], the so-called factorization norm of W [17], and the so-called packing number associatedwith W [9]. While interesting theoretically, the factorizationnorm and packing number are hard to calculate in practice.In contrast, our bound can be easily calculated.Theorem 5.6 gives a lower bound on our optimizationobjective. Translating that back to worst-case variance givesus Corollary 5.7. Corollary 5.7 (Worst-case variance)
The worst-case vari-ance of any factorization mechanism must be at least: Nn exp ( (cid:15) ) ( λ + · · · + λ n ) − Nn (cid:107) W (cid:107) F ≤ L worst ( V , Q )Combining Corollary 5.7 with Corollary 5.4 and applyingit to the Histogram workload gives us a lower bound on thesample complexity. Example 5.8 (Lower Bound for Histogram Workload)
Every workload factorization mechanism requires at least α (cid:0) (cid:15) ) − n ) samples to achieve normalized variance α onthe Histogram workload. Note the very weak dependence on n in Example 5.8, whichsuggests that the sample complexity should not change muchwith n . Further, recall from Example 5.5 that the samplecomplexity of randomized response is linear in n . Thissuggests randomized response is not the best mechanism forthe Histogram workload. This result is not new, as there areseveral mechanisms that are known to perform better thanrandomized response [1, 45, 41, 18]. We show empiricallyin Section 6 that some of these mechanisms achieve theoptimal sample complexity for the Histogram workload up toconstant factors (i.e., no dependence on n ). Our mechanismalso achieves the optimal sample complexity for this workload,but has better constant factors.For other workloads, the sample complexity may depend on n . Calculating the exact dependence on n for other workloadsrequires deriving the singular values of the workload as afunction of n in closed form, which may be challenging forworkloads with complicated structure.
6. EXPERIMENTS
In this section we experimentally evaluate our mecha-nism. We extensively study the utility of our mechanismon a variety of workloads, domains, and privacy levels, andcompare it against multiple competing mechanisms fromthe literature. We demonstrate consistent improvements inutility compared to other mechanisms in all settings (Sec-tion 6.2 and Section 6.3 and Section 6.4). We also study therobustness and scalability of our optimization algorithm(Section 6.5 and Section 6.6). The source code for ourmechanism, and other mechanisms represented as a strategymatrix, is available at https://github.com/ryan112358/workload-factorization-mechanism . Workloads.
We consider six different workloads in our em-pirical analysis, each of which can be defined for a specifieddomain size. These workloads are intended to capture com-mon queries an analyst might want to perform on data andhave been studied previously in the privacy literature. Theseworkloads are Histogram, Prefix, All Range, All Marginals,7 .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Epsilon S a m p l e s Workload=Histogram, Domain=512
Randomized ResponseHadamardHierarchicalFourierMatrix Mechanism ( L )Matrix Mechanism ( L )Optimized 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Epsilon S a m p l e s Workload=Prefix, Domain=512
Epsilon S a m p l e s Workload=All Range, Domain=512
Epsilon S a m p l e s Workload=All Marginals, Domain=512
Epsilon S a m p l e s Workload=3-Way Marginals, Domain=512
Epsilon S a m p l e s Workload=Parity, Domain=512
Figure 1: Sample complexity of 7 algorithms on 6 workloads for (cid:15) ∈ [0 . , . Mechanisms.
We compare our mechanism against six otherstate-of-the-art mechanisms, including Randomized Response[44], Hadamard [2], Hierarchical [13, 42], Fourier [12], andthe Matrix Mechanism [27, 17] (both L and L versions).While the Matrix Mechanism is typically thought of as atechnique for central differential privacy, it has been studiedtheoretically as a mechanism for local differential privacy aswell [17]. This version of the “distributed” Matrix Mechanismis what we compare against in experiments.The first four mechanisms are all particular instances ofthe class of factorization mechanisms, just with differentfactorizations. They were all designed to answer a fixedworkload (e.g., Randomized Response was designed for theHistogram workload), but they can still be run on otherworkloads with minor modifications. In particular, for eachmechanism we use the same Q across different workloads,but change V based on the workload, using Theorem 3.10.We omit from comparison the Gaussian mechanism [4],as it is strictly dominated by the L Matrix Mechanism.We also omit from comparison RAPPOR [18] and SubsetSelection [45], as they require exponential space to representthe strategy matrix, making it prohibitive to calculate worst-case variance and sample complexity. However, we notethat these mechanisms have been previously compared withHadamard, and shown to offer comparable performance on the Histogram workload [2].
Evaluation.
Our primary evaluation metric for comparingalgorithms is sample complexity , which we calculate exactlyusing Corollary 5.4 with α = 0 .
01. Recall that the samplecomplexity is proportional to the worst-case variance, but isappropriately normalized and easier to interpret. Further-more, we remark that for most experiments in this section, noinput data is required, as the sample complexities we reportapply for the worst-case dataset. In practice, we have foundthat the variance on real world datasets is quite close to theworst-case variance, as we demonstrate in Section 6.4. Wealso vary the privacy budget (cid:15) and domain size n , studyingtheir impact on the sample complexity for each mechanismand workload. Figure 1 shows the relationship between workload and (cid:15) on the sample complexity for each mechanism. We consider (cid:15) ranging from 0 . .
0, fixing n to be 512. These privacybudgets are common in practical deployments of differentialprivacy, and local differential privacy in particular [41, 14,18]. We state our main findings below: • Our mechanism (labeled Optimized) is consistently thebest in all settings: it requires fewer samples than everyother mechanism on every workload and (cid:15) we tested. • The magnitude of improvement over the best competitorvaries between 1 . (cid:15) = 0 .
5) and 14 . (cid:15) = 4 . . (cid:15) (cid:28) .
5, our mechanism is typically quiteclose to the best competitor, and in the very low-privacyregime with (cid:15) (cid:29) .
0, our mechanism matches randomizedresponse, which is optimal in that regime. The reductionof required samples translates to a context that reallymatters: data collectors can now run their analyses onsmaller samples to achieve their desired accuracy.8 Domain S a m p l e s Workload=Histogram, Epsilon=1.0
Randomized ResponseHadamardHierarchicalFourierMatrix Mechanism ( L )Matrix Mechanism ( L )Optimized 10 Domain S a m p l e s Workload=Prefix, Epsilon=1.0 Domain S a m p l e s Workload=All Range, Epsilon=1.0 Domain S a m p l e s Workload=All Marginals, Epsilon=1.0 Domain S a m p l e s Workload=3-Way Marginals, Epsilon=1.0 Domain S a m p l e s Workload=Parity, Epsilon=1.0
Figure 2: Sample complexity of 7 algorithms on 6 workloads for n ∈ [8 , • The best competitor changes with the workload and (cid:15) .For example, the best competitor on the Prefix workloadwas Hierarchical, while the best competitor on the 3-WayMarginals workload was Fourier. In both cases, thesemechanisms were specifically designed to offer low erroron their respective workloads, but they don’t work as wellon other workloads. Additionally, Randomized Responseis often the best mechanism at high (cid:15) , so even for a fixedworkload, the best competitor is not always the same. Onthe other hand, our mechanism adapts effectively to theworkload and (cid:15) , and works well in all settings. As a result,only one algorithm needs to be implemented, rather thanan entire library of algorithms, and, accordingly it is notnecessary to select among alternative algorithms. • Some workloads are inherently harder to answer thanothers: the number of samples required by our mechanismdiffers by up to two orders of magnitude between workloads.The easiest workload appears to be Histogram, while thehardest is Parity. This is consistent with the lower boundwe gave in Theorem 5.6, which characterizes the hardnessof the workload in terms of its singular values – the boundis much lower for Histogram than for Parity.
Figure 2 shows the relationship between workload and n on the sample complexity for each mechanism. We consider n ranging from 8 to 1024, fixing (cid:15) to be 1 .
0. We state ourmain findings below: • For the Histogram workload, there is almost no dependenceon the domain size for every mechanism except randomizedresponse. This is consistent with our finding in Example 5.8regarding the lower bound on sample complexity. Thisobservation is unique to the Histogram workload, however. • The mechanisms that were designed for a given workload,and those that adapt to the workload, have a better de-pendence on the domain size (smaller slope) than the mechanisms that do not. This includes the L MatrixMechanism, which is worse than every other mechanism inmost settings, but slowly overtakes the other mechanismsfor large domain sizes. • The sample complexity of our mechanism and other mech-anisms tailored to the workload is generally O ( √ n ), as theslope of the lines are ≈ . On the otherhand, the sample complexity of the mechanisms not tai-lored to the workload is more like O ( n ) (as the slope of theline is ≈ . Whereas results in previous sections focused on worst-case sample complexity, we now turn our attention to samplecomplexity on real-world benchmark datasets obtained fromthe DPBench study [22]. To calculate the sample complexityon real data, we simply replace L worst in Corollary 5.4 withthe exact (data-dependent) expression for total variancestated in Theorem 3.4.In Figure 3a, we plot the sample complexity of each mech-anism on three datasets for each mechanism on the Prefixworkload, fixing n = 512 and (cid:15) = 1 .
0. We also plot the worst-case sample complexity for reference. As expected, our mech-anism still outperforms all others on each dataset. In fact, allmechanisms performed pretty consistently, offering similarsample complexities for each dataset. The largest deviationbetween datasets occurs for the Hadamard mechanism, whereit needs 1 . × more samples for the NETTRACE datasetthan for the HEPTH dataset. The Optimized mechanism iseven more consistent, as the largest deviation is only 1 . × .Additionally, the real-world sample complexity is very well-approximated by the worst-case sample complexity for the the slope of a line in log space corresponds to the power ofa polynomial in linear space; i.e., log y = m log x → y = x m .9 EPTH MEDCOST NETTRACE Worst-caseDataset S a m p l e s Randomized ResponseHadamardHierarchicalFourier Matrix Mechanism ( L )Matrix Mechanism ( L )Optimized (a) n 4n 8n 12n 16n m V a r i a n c e R a t i o All MarginalsAll RangeHistogram3-Way MarginalsParityPrefix (b) Domain Size P e r - I t e r a t i o n T i m e ( s ) (c)Figure 3: (a) Sample complexity on benchmark datasets for Prefix workload. (b) Worst-case variance (ratio to best found) ofoptimized strategy for various m . (c) Per-iteration time complexity of optimization for increasing domain sizes.Optimized mechanism, as the maximum deviation is only1 . × . This suggests the conclusions drawn based on worst-case sample complexity in Figure 1 and Figure 2 hold forreal-world data as well. Although not shown, we repeatedthis experiment for other workloads and settings and madesimilar observations. Recall that our optimization algorithm is initialized with arandom strategy matrix, and that different initial strategiescan lead to different optimized strategies. In this section, weaim to understand how sensitive our optimization algorithmis to the different initializations, and whether it depends on m , the number of rows in the strategy matrix.We fix n = 64 and (cid:15) = 1 .
0, and vary m from 2 n to 16 n . Foreach m , we compute 10 optimized strategies with differentrandom initializations and record the worst-case variance foreach strategy. In order to plot all workloads on the samefigure, we normalize the worst-case variance to the best foundacross all trials. In Figure 3b, we plot the median varianceratio for each m as well as an error bar to indicate the minand max ratio obtained across the 10 trials. We observe thatthe optimization is quite robust to the initialization, andproduces pretty consistent results between runs, as evidentby the small error bars. Furthermore, the optimization is notvery sensitive to the choice of m , as all optimized strategiesare within a factor of 1 .
21 to the best found. Strategies tendto get closer to optimal for larger m , and eventually leveloff, with the exception of the Parity workload. We suspectthis difference is due to the fact that Parity is a low-rankworkload, and doesn’t require a large strategy. Using m = 4 n as we did in other experiments tends to produce strategieswithin a factor of 1 .
05 to 1 . m , as this extra 10%improvement is meaningful in practice. We measure the scalability of optimization by looking atthe per-iteration time complexity. In each iteration, we mustevaluate the objective function (and its gradient) stated inTheorem 3.11, then project onto the constraint set usingAlgorithm 1. We assume W T W has been precomputed, andnote that the per-iteration time complexity only depends on W T W through its size, and not its contents. We therefore use the n × n identity matrix for W . Additionally, we let Q be a random 4 n × n strategy matrix. In Figure 3c, we reportthe per-iteration time required for increasing domain sizes,averaged over 15 iterations. As we can see, optimizationscales up to domains as large as n = 4096, where it takesabout 139 seconds per iteration. While expensive, it is notunreasonable to run for a few hundred iterations in this case,and is an impressive scalability result given that there areover 67 million optimization variables when n is that large.Additionally, we note that strategy optimization is a one-timecost, and it can be done offline before deploying the mecha-nism to the users. Furthermore, as we showed in Section 6.3,the number of samples required typically increases with thedomain size, so there is good reason to run mechanisms onsmall domains whenever possible, compressing it if necessary.In general, the plot shows that the time grows roughly at a O ( n ) rate, as it took about 19 seconds for n = 2048 and2 . n = 1024, confirming the theoretical timecomplexity analysis. We now experimentally evaluate the extension we pro-posed in Remark 1, which we call workload non-negativeleast squares (WNNLS). The full details are described inAppendix A. For this experiment, we fix (cid:15) = 1 . N = 10 ,and n = 512, and use a random sample of data from the“HEPTH” dataset obtained from the DPBench study [22],but note that results on other datasets were similar. Withthis extension, we no longer have a closed form expression forvariance, so we run 100 simulations to estimate it instead.Figure 4 shows the (normalized) variance of the mecha-nism on this dataset with and without the extension. Aswe can see, the extension reduces the variance in all cases,and the improvement ranges from 1 .
96 to 5 .
6, which is a sig-nificant amount. In general, the magnitude of improvementdepends on factors like (cid:15) and N , which are not varied here.When N and (cid:15) are sufficiently large, the default workloadquery estimates will already be non-negative, in which caseWNNLS would offer no improvement. WNNLS offers sig-nificant improvement for many practical (cid:15) and N , however.Additionally, we note that this extension can be plugged intoany of the other competing mechanisms as well, and offerssimilar utility improvements.10 Workload N o r m a li z e d V a r i a n c e
1. Histogram 2. Prefix 3. All Range 4. All Marginals 5. 3-Way Marginals 6. ParityDefaultWNNLS
Figure 4: Variance of the optimized mechanism with andwithout the WNNLS extension.
7. RELATED WORK
The mechanism we propose in this work is related to alarge body of research in both the central and local modelof differential privacy.Answering linear queries under central differential privacyis a widely studied topic [3, 30, 33, 8, 28]. Many state-of-the-art mechanisms for this task, like the Matrix Mechanism [27],achieve privacy by adding Laplace or Gaussian noise to acarefully selected set of “strategy queries”. This query strat-egy is tailored to the workload, and can even be optimizedfor it, as the Matrix Mechanism does. The optimizationproblem posed by the Matrix Mechanism has been studiedtheoretically [27, 29], and several algorithms have been pro-posed to solve it or approximately solve it [47, 28, 46, 33].While similar in spirit to our mechanism, the optimizationproblem underlying the Matrix Mechanism is substantiallydifferent from ours, as it requires search over a different spaceof mechanisms tailored to central differential privacy. Forboth optimization problems, the optimization variable is a so-called “strategy matrix”, but these represent fundamentallydifferent things in each mechanism, and hence the constraintson the strategy matrix differ. For the Matrix Mechanism,the strategy matrix encodes a set of linear queries which willbe answered with a noise-addition mechanism. In contrast,the strategy matrix for our mechanism encodes a conditionalprobability distribution.Answering linear queries under local differential privacyhas received less attention, but one notable idea is to directlyapply mechanisms from central differential privacy to thelocal model. This translation can be achieved by simplyexecuting the mechanism independently for each single-userdatabase, then aggregating the results. This approach hasbeen studied theoretically with the Gaussian mechanism [4]and the Matrix Mechanism [17]. While these mechanismstrivially provide privacy, they tend to have poor utility inpractice, as they are not tailored to the local model. Anothernotable approach for this task casts it as a mean estimationproblem, and uses LDP mechanisms designed for that [9].These works provide a thorough theoretical treatment of thisproblem, showing bounds on achieved error, but no practicalimplementation or evaluation.More work has been done to answer specific, fixed work-loads of general interest, such as histograms [1, 45, 41, 44, 18,25, 6, 40], range queries [13, 42], and marginals [12, 42]. Avery nice summary of the computational complexity, samplecomplexity and communication complexity for the mecha-nisms designed for the Histogram workload is given in [1]. Interestingly, even for the very simple Histogram workload,there are multiple mechanisms because the optimal mech-anism is not clear. This is in stark contrast to the centralmodel of differential privacy, where, for the Histogram work-load it is clear that the optimal strategy is the workloaditself. Almost all of these mechanisms are instances of thegeneral class of mechanisms we consider in Definition 3.2,just with different workload factorizations. The strategy ma-trices for these workloads were all carefully designed to offerlow error on the workloads they were designed for, by ex-ploiting knowledge about those specific workloads. However,none of these mechanisms perform optimization to choosethe strategy matrix, instead it is fixed in advance.Kairouz et al. propose an optimization-based approach tomechanism design as well [26]. The mechanism they proposeis not designed to estimate histograms or workload queries,but for other statistical tasks, namely hypothesis testingand something they call information preservation. They alsoconsider the class of mechanisms characterized by a strat-egy matrix (Proposition 2.6), and propose an optimizationproblem over this space to maximize utility for a given task.Moreover, for convex and sublinear utility functions, theyshow that the optimal mechanism is a so-called extremalmechanism , and state a linear program to find this optimalmechanism. Unfortunately, there are 2 n optimization vari-ables in this linear program, making it infeasible to solve inpractice. Furthermore, the restriction on the utility function(sublinear, convex) prevents the technique from applying toour setting.
8. CONCLUSION
We proposed a new LDP mechanism that adapts to aworkload of linear queries provided by an analyst. We for-mulated this as a constrained optimization problem over anexpressive class of unbiased LDP mechanisms and proposeda projected gradient descent algorithm to solve this problem.We showed experimentally that our mechanism outperformsall other existing LDP mechanisms in a variety of settings,even outperforming mechanisms on the workloads for whichthey were intended.
Acknowledgements
We would like to thank Daniel Sheldon for his insights re-garding Theorem 3.4. This work was supported by theNational Science Foundation under grants CNS-1409143,CCF-1642658, CCF-1618512, TRIPODS-1934846; and byDARPA and SPAWAR under contract N66001-15-C-4067.11
PPENDIX
Symbol Meaning n domain size N number of users (cid:15) privacy budget M privacy mechanism U set of possible users, |U| = n O set of possible outcomes, |O| = m W p × n workload matrix Q m × n strategy matrix V p × m reconstruction matrix x data vector y response vector, y = M Q ( x ) vector of ones D Q Diag( Q1 ) Q † pseudo-inverse q u u th column of Qv i i th row of V tr [ · ] trace of a matrix L worst ( V , Q ) worst-case variance L avg ( V , Q ) average-case variance L ( V , Q ) optimization objective L ( Q ) optimization objective for Q Diag( · ) diagonal matrix from vectorTable 2: Notation A. NON-NEGATIVITY & CONSISTENCY
In this section, we describe and evaluate a simple exten-sion to our mechanism that can greatly improve its utility inpractice. While our basic mechanism offers unbiased answersto the workload queries, this can come at the cost of certainanomalies. For example, the estimated answer to a workloadquery might be negative, even though the true answer couldnever be negative. This motivates us to consider an exten-sion where we try to account for the structural constraintswe know about data vector. Specifically, we propose thefollowing non-negative least squares problem to find a non-negative (feasible) ˆ x such that W ˆ x is as close as possible toour unbiased estimate Vy . The problem at hand is:ˆ x = arg min x ≥ (cid:107) Wx − Vy (cid:107) Once ˆ x is obtained, the workload answers can be estimatedby computing W ˆ x . While this estimate is not necessarilyunbiased, it often has substantially lower variance than Vy and can be a worthwhile trade-off, particularly in the high-privacy/low-data regime where non-negativity is a biggerissue. A similar problem was studied theoretically by Nikolovet al. [35] and empirically by Li et al. [30] in the centralmodel of differential privacy, where it has been shown to offersignificant utility improvements. There are various pythonimplementations to solve the above problem efficiently [39,48, 34]. We simply use the limited memory BFGS algorithmfrom scipy to solve it [31, 39]. B. MISSING PROOFS
Proof of Theorem 3.4.
We begin by deriving the vari-ance for a single query v T y (where y = M Q ( x )). Note that y is a sum of multinomial random variables s u instantiatedwith parameters n = x u and p = q u , where s u is the responsevector for all users of type u . Using the well-known formulafor the covariance of a multinomial random variable, thecovariance of a sum, and the variance of a linear combinationof correlated random variables, we obtain:Cov[ s u ] = x u (Diag( q u ) − q u q Tu )Cov[ y ] = (cid:88) u x u (Diag( q u ) − q u q Tu )Var[ v T y ] = v T Cov[ y ] v = (cid:88) u x u ( v T Diag( q u ) v − v T q u q Tu v )= (cid:88) u x u ( v T Diag( q u ) v − ( v T q u ) )The total variance is obtained by summing over all the rowsof V . This completes the proof. Proof of Theorem 3.9.
We prove the claim by showingthe two objectives are the same up to constant additive andmultiplicative factors. L avg ( V , Q ) ∝ (cid:88) u ∈U p (cid:88) i =1 v Ti Diag( q u ) v i − ( v Ti q u ) = (cid:88) u ∈U p (cid:88) i =1 v Ti Diag( q u ) v i − (cid:107) VQ (cid:107) F = p (cid:88) i =1 v Ti Diag (cid:16) (cid:88) u q u (cid:17) v i − (cid:107) W (cid:107) F = tr [ VD Q V T ] − (cid:107) W (cid:107) F Since W is constant, we can drop that term for the pur-poses of defining the objective function. This completes theproof. Proof of Theorem 3.10.
Observe that we can construct V one row at a time because there are no interaction termsbetween v i and v j for i (cid:54) = j in the objective function. Fur-thermore, we can optimize v i through the following quadraticprogram. minimize v i v Ti D Q v i subject to Q T v i = w i The above problem is closely related to a standard normminimization problem, and can be transformed to one bymaking the substitution u i = D Q v i . The problem becomes:minimize u i (cid:107) u i (cid:107) subject to ( Q T D − Q ) u i = w i This unique solution to this problem is given by u i =( Q T D − Q ) † w i [10]. Using the Hermitian reduction identity[7] X † = X T ( XX T ) † , we have: v i = D − Q u i = D − Q ( Q T D − Q ) † w i = D − Q D − Q Q ( Q T D − Q D − Q Q ) † w i = D − Q Q ( Q T D − Q Q ) † w i i , we arrive at the desired solution. V = W ( Q T D − Q Q ) † Q T D − Q Proof of Theorem 3.11.
We plug in the optimal solu-tion for V as given in Theorem 3.10 and simplify using linearalgebra identities and the cyclic permutation property oftrace. L ( Q ) (cid:44) min V L ( V , Q )= tr [ W ( Q T D − Q Q ) † Q T D − Q D Q D − Q Q ( Q T D − Q Q ) † W T ]= tr [ W ( Q T D − Q Q ) † ( Q T D − Q Q )( Q T D − Q Q ) † W T ]= tr [ W ( Q T D − Q Q ) † W T ]= tr [( Q T D − Q Q ) † ( W T W )] Proof of Theorem 5.1.
It is obvious that the worst-case variance is greater than (or equal to) the average-casevariance. We will now bound the worst-case variance fromabove. Using u ∗ to denote the worst-case user, we have thefollowing upper bound on worst-case variance: L worst ( V , Q )= N max u ∈U p (cid:88) i =1 v Ti Diag( q u ) v i − ( v Ti q u ) a ) ≤ N p (cid:88) i =1 v Ti Diag( q u ∗ ) v i = Nn (cid:88) u ∈U p (cid:88) i =1 v Ti Diag( q u ∗ ) v i ( b ) ≤ e (cid:15) Nn (cid:88) u ∈U p (cid:88) i =1 v Ti Diag( q u ) v i ( c ) = e (cid:15) (cid:16) Nn (cid:88) u ∈U p (cid:88) i =1 v Ti Diag( q u ) v i − ( v Ti q u ) (cid:17) + e (cid:15) Nn (cid:107) W (cid:107) F = e (cid:15) (cid:0) L avg ( V , Q ) + Nn || W || F (cid:1) In step (a), we use the fact that ( v Ti q u ) is non-negative.In step (b), we apply the fact that q u ∗ ≤ exp ( (cid:15) ) q u for all u .In step (c), we express the bound in terms of L avg , adding 0in the form of (cid:107) W (cid:107) F − (cid:80) u (cid:80) i ( v Ti q u ) . This completes theproof. Proof of Theorem 5.6.
Consider the following optimiza-tion problem which is closely related to Problem 3.12.minimize X (cid:31) tr [ X − ( W T W )]subject to X uu ≤ n ( λ + · · · + λ n ) [29]. See also [46]. Fur-thermore, if X ∗ is the optimal solution and the constraintis replaced with X uu ≤ c then c X ∗ remains optimal [30], inwhich case the bound becomes nc ( λ + · · · + λ n ) .We will now argue that any feasible solution to our problemcan be directly transformed into a feasible solution of theabove related problem. Suppose Q is a feasible solutionto Problem 3.12 and let X = Q T D − Q Q . Note that the objective functions are identical now. We will argue that X uu ≤ exp ( (cid:15) ) n . X uu = (cid:88) o Q ou (cid:80) u (cid:48) Q ou ≤ (cid:88) o Q ou exp ( (cid:15) ) n Q ou = exp ( (cid:15) ) n (cid:88) o Q ou = exp ( (cid:15) ) n Thus, we have shown that any solution to Problem 3.12 givesrise to a corresponding solution to the above problem. Thus,the SVD Bound applies and we arrive at the desired result:1exp ( (cid:15) ) ( λ + · · · + λ n ) ≤ L ( Q ) Proof of Corollary 5.7. L worst ( V , Q ) ≥ L avg ( V , Q )= Nn (cid:2) L ( V , Q ) − (cid:107) W (cid:107) F (cid:3) ≥ Nn (cid:2) L ( Q ) − (cid:107) W (cid:107) F (cid:3) ≥ Nn (cid:104) (cid:15) ) ( λ + · · · + λ n ) − (cid:107) W (cid:107) F (cid:105) . REFERENCES [1] J. Acharya, Z. Sun, and H. Zhang. Communicationefficient, sample optimal, linear time locally privatediscrete distribution estimation. arXiv preprintarXiv:1802.04705 , 2018.[2] J. Acharya, Z. Sun, and H. Zhang. Hadamard response:Estimating distributions privately, efficiently, and withlittle communication. arXiv preprint arXiv:1802.04705 ,2018.[3] B. Barak, K. Chaudhuri, C. Dwork, S. Kale,F. McSherry, and K. Talwar. Privacy, accuracy, andconsistency too: a holistic solution to contingency tablerelease. In Proceedings of the twenty-sixth ACMSIGMOD-SIGACT-SIGART symposium on Principlesof database systems , pages 273–282. ACM, 2007.[4] R. Bassily. Linear queries estimation with localdifferential privacy. In K. Chaudhuri and M. Sugiyama,editors,
Proceedings of Machine Learning Research ,volume 89 of
Proceedings of Machine LearningResearch , pages 721–729. PMLR, 16–18 Apr 2019.[5] R. Bassily, K. Nissim, U. Stemmer, and A. G.Thakurta. Practical locally private heavy hitters. In
Advances in Neural Information Processing Systems ,pages 2288–2296, 2017.[6] R. Bassily and A. Smith. Local, private, efficientprotocols for succinct histograms. In
Proceedings of theforty-seventh annual ACM symposium on Theory ofcomputing , pages 127–135, 2015.[7] A. Ben-Israel and T. N. Greville.
Generalized inverses:theory and applications , volume 15. Springer Science &Business Media, 2003.[8] A. Bhaskara, D. Dadush, R. Krishnaswamy, andK. Talwar. Unconditional differentially privatemechanisms for linear queries. In
Proceedings of theforty-fourth annual ACM symposium on Theory ofcomputing , pages 1269–1284. ACM, 2012.[9] J. Blasiok, M. Bun, A. Nikolov, and T. Steinke.Towards instance-optimal private query release. In
Proceedings of the Thirtieth Annual ACM-SIAMSymposium on Discrete Algorithms , pages 2480–2497.Society for Industrial and Applied Mathematics, 2019.[10] S. Boyd and L. Vandenberghe.
Convex optimization .Cambridge university press, 2004.[11] G. Casella and R. L. Berger.
Statistical inference ,volume 2. Duxbury Pacific Grove, CA, 2002.[12] G. Cormode, T. Kulkarni, and D. Srivastava. Marginalrelease under local differential privacy. In
Proceedingsof the 2018 International Conference on Managementof Data , pages 131–146. ACM, 2018.[13] G. Cormode, T. Kulkarni, and D. Srivastava.Answering range queries under local differential privacy.
Proceedings of the VLDB Endowment ,12(10):1126–1138, 2019.[14] B. Ding, J. Kulkarni, and S. Yekhanin. Collectingtelemetry data privately. In
Advances in NeuralInformation Processing Systems , pages 3571–3580,2017.[15] C. Dwork, F. McSherry, K. Nissim, and A. Smith.Calibrating noise to sensitivity in private data analysis.In
Theory of cryptography conference , pages 265–284.Springer, 2006. [16] C. Dwork, A. Roth, et al. The algorithmic foundationsof differential privacy.
Foundations and Trends R (cid:13) inTheoretical Computer Science , 9(3–4):211–407, 2014.[17] A. Edmonds, A. Nikolov, and J. Ullman. The power offactorization mechanisms in local and centraldifferential privacy. arXiv preprint arXiv:1911.08339 ,2019.[18] ´U. Erlingsson, V. Pihur, and A. Korolova. Rappor:Randomized aggregatable privacy-preserving ordinalresponse. In Proceedings of the 2014 ACM SIGSACconference on computer and communications security ,pages 1054–1067. ACM, 2014.[19] M. Gaboardi, E. J. G. Arias, J. Hsu, A. Roth, and Z. S.Wu. Dual query: Practical private query release forhigh dimensional data. In
International Conference onMachine Learning , pages 1170–1178, 2014.[20] A. Griewank et al. On automatic differentiation.
Mathematical Programming: recent developments andapplications , 6(6):83–107, 1989.[21] M. Hardt, K. Ligett, and F. McSherry. A simple andpractical algorithm for differentially private datarelease. In
Advances in Neural Information ProcessingSystems , pages 2339–2347, 2012.[22] M. Hay, A. Machanavajjhala, G. Miklau, Y. Chen, andD. Zhang. Principled evaluation of differentially privatealgorithms using dpbench. In
Proceedings of the 2016International Conference on Management of Data ,pages 139–154. ACM, 2016.[23] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boostingthe accuracy of differentially private histogramsthrough consistency.
Proceedings of the VLDBEndowment , 3(1-2):1021–1032, 2010.[24] N. Holohan, D. J. Leith, and O. Mason. Extremepoints of the local differential privacy polytope.
LinearAlgebra and its Applications , 534:78–96, 2017.[25] P. Kairouz, K. Bonawitz, and D. Ramage. Discretedistribution estimation under local privacy. arXivpreprint arXiv:1602.07387 , 2016.[26] P. Kairouz, S. Oh, and P. Viswanath. Extremalmechanisms for local differential privacy. In
Advancesin neural information processing systems , pages2879–2887, 2014.[27] C. Li, M. Hay, V. Rastogi, G. Miklau, andA. McGregor. Optimizing linear counting queries underdifferential privacy. In
Proceedings of the twenty-ninthACM SIGMOD-SIGACT-SIGART symposium onPrinciples of database systems , pages 123–134. ACM,2010.[28] C. Li and G. Miklau. An adaptive mechanism foraccurate query answering under differential privacy.
Proceedings of the VLDB Endowment , 5(6):514–525,2012.[29] C. Li and G. Miklau. Lower bounds on the error ofquery sets under the differentially-private matrixmechanism.
Theory of Computing Systems ,57(4):1159–1201, 2015.[30] C. Li, G. Miklau, M. Hay, A. McGregor, andV. Rastogi. The matrix mechanism: optimizing linearcounting queries under differential privacy.
The VLDBjournal , 24(6):757–781, 2015.[31] D. C. Liu and J. Nocedal. On the limited memory bfgsmethod for large scale optimization.
Mathematical rogramming , 45(1-3):503–528, 1989.[32] D. Maclaurin, D. Duvenaud, and R. P. Adams.Autograd: Effortless gradients in numpy. In ICML 2015AutoML Workshop , volume 238, 2015.[33] R. McKenna, G. Miklau, M. Hay, andA. Machanavajjhala. Optimizing error ofhigh-dimensional statistical queries under differentialprivacy.
Proceedings of the VLDB Endowment ,11(10):1206–1219, 2018.[34] R. McKenna, D. Sheldon, and G. Miklau.Graphical-model based estimation and inference fordifferential privacy. In
Proceedings of the 36thInternational Conference on Machine Learning (ICML) ,2019.[35] A. Nikolov, K. Talwar, and L. Zhang. The geometry ofdifferential privacy: the sparse and approximate cases.In
Proceedings of the forty-fifth annual ACMsymposium on Theory of computing , pages 351–360.ACM, 2013.[36] J. Nocedal and S. Wright.
Numerical optimization .Springer Science & Business Media, 2006.[37] G. Strang, G. Strang, G. Strang, and G. Strang.
Introduction to linear algebra , volume 3.Wellesley-Cambridge Press Wellesley, MA, 1993.[38] A. G. Thakurta, A. H. Vyrros, U. S. Vaishampayan,G. Kapoor, J. Freudiger, V. R. Sridhar, andD. Davidson. Learning new words, Mar. 14 2017. USPatent 9,594,741.[39] P. Virtanen, R. Gommers, T. E. Oliphant,M. Haberland, T. Reddy, D. Cournapeau, E. Burovski,P. Peterson, W. Weckesser, J. Bright, S. J. van derWalt, M. Brett, J. Wilson, K. Jarrod Millman,N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern,E. Larson, C. Carey, ˙I. Polat, Y. Feng, E. W. Moore,J. Vand erPlas, D. Laxalde, J. Perktold, R. Cimrman,I. Henriksen, E. A. Quintero, C. R. Harris, A. M.Archibald, A. H. Ribeiro, F. Pedregosa, P. vanMulbregt, and S. . . Contributors. SciPy1.0–Fundamental Algorithms for Scientific Computingin Python. arXiv e-prints , page arXiv:1907.10121, Jul2019.[40] S. Wang, L. Huang, P. Wang, Y. Nie, H. Xu, W. Yang,X.-Y. Li, and C. Qiao. Mutual information optimallylocal private discrete distribution estimation. arXivpreprint arXiv:1607.08025 , 2016.[41] T. Wang, J. Blocki, N. Li, and S. Jha. Locallydifferentially private protocols for frequency estimation.In
Proc. of the 26th USENIX Security Symposium ,pages 729–745, 2017.[42] T. Wang, B. Ding, J. Zhou, C. Hong, Z. Huang, N. Li,and S. Jha. Answering multi-dimensional analyticalqueries under local differential privacy. In
Proceedingsof the 2019 International Conference on Managementof Data , pages 159–176, 2019.[43] W. Wang and M. A. Carreira-Perpin´an. Projectiononto the probability simplex: An efficient algorithmwith a simple proof, and an application. arXiv preprintarXiv:1309.1541 , 2013.[44] S. L. Warner. Randomized response: A surveytechnique for eliminating evasive answer bias.
Journalof the American Statistical Association , 60(309):63–69,1965. [45] M. Ye and A. Barg. Optimal schemes for discretedistribution estimation under locally differentialprivacy.
IEEE Transactions on Information Theory ,64(8):5662–5676, 2018.[46] G. Yuan, Y. Yang, Z. Zhang, and Z. Hao. Convexoptimization for linear query processing underapproximate differential privacy. In
Proceedings of the22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pages2005–2014. ACM, 2016.[47] G. Yuan, Z. Zhang, M. Winslett, X. Xiao, Y. Yang,and Z. Hao. Low-rank mechanism: optimizing batchqueries under differential privacy.
Proceedings of theVLDB Endowment , 5(11):1352–1363, 2012.[48] D. Zhang, R. McKenna, I. Kotsogiannis, M. Hay,A. Machanavajjhala, and G. Miklau. Ektelo: Aframework for defining differentially-privatecomputations. In