Spectral Norm of Random Kernel Matrices with Applications to Privacy
SSpectral Norm of Random Kernel Matrices with Applications to Privacy
Shiva Kasiviswanathan ∗ Mark Rudelson † Abstract
Kernel methods are an extremely popular set of techniques used for many important machine learning anddata analysis applications. In addition to having good practical performance, these methods are supported bya well-developed theory. Kernel methods use an implicit mapping of the input data into a high dimensionalfeature space defined by a kernel function, i.e., a function returning the inner product between the imagesof two data points in the feature space. Central to any kernel method is the kernel matrix, which is built byevaluating the kernel function on a given sample dataset.In this paper, we initiate the study of non-asymptotic spectral theory of random kernel matrices. Theseare n × n random matrices whose ( i, j ) th entry is obtained by evaluating the kernel function on x i and x j ,where x , . . . , x n are a set of n independent random high-dimensional vectors. Our main contribution is toobtain tight upper bounds on the spectral norm (largest eigenvalue) of random kernel matrices constructed bycommonly used kernel functions based on polynomials and Gaussian radial basis.As an application of these results, we provide lower bounds on the distortion needed for releasing the coef-ficients of kernel ridge regression under attribute privacy, a general privacy notion which captures a large classof privacy definitions. Kernel ridge regression is standard method for performing non-parametric regressionthat regularly outperforms traditional regression approaches in various domains. Our privacy distortion lowerbounds are the first for any kernel technique, and our analysis assumes realistic scenarios for the input, unlikeall previous lower bounds for other release problems which only hold under very restrictive input settings. In recent years there has been significant progress in the development and application of kernel methods for manypractical machine learning and data analysis problems. Kernel methods are regularly used for a range of problemssuch as classification (binary/multiclass), regression, ranking, and unsupervised learning, where they are knownto almost always outperform “traditional” statistical techniques [23, 24]. At the heart of kernel methods is thenotion of kernel function , which is a real-valued function of two variables. The power of kernel methods stemsfrom the fact for every (positive definite) kernel function it is possible to define an inner-product and a lifting(which could be nonlinear) such that inner-product between any two lifted datapoints can be quickly computedusing the kernel function evaluated at those two datapoints. This allows for introduction of nonlinearity into thetraditional optimization problems (such as Ridge Regression, Support Vector Machines, Principal ComponentAnalysis) without unduly complicating them.The main ingredient of any kernel method is the kernel matrix , which is built using the kernel function,evaluated at given sample points. Formally, given a kernel function κ : X × X → R and a sample set x , . . . , x n ,the kernel matrix K is an n × n matrix with its ( i, j ) th entry K ij = κ ( x i , x j ) . Common choices of kernelfunctions include the polynomial kernel ( κ ( x i , x j ) = ( a (cid:104) x i , x j (cid:105) + b ) p , for p ∈ N ) and the Gaussian kernel( κ ( x i , x j ) = exp( − a (cid:107) x i − x j (cid:107) ) , for a > ) [23, 24].In this paper, we initiate the study of non-asymptotic spectral properties of random kernel matrices . Arandom kernel matrix, for a kernel function κ , is the kernel matrix K formed by n independent random vectors ∗ Samsung Research America, [email protected] . Part of the work done while the author was at General Electric Research. † University of Michigan, [email protected] . a r X i v : . [ s t a t . M L ] A p r , . . . , x n ∈ R d . The prior work on random kernel matrices [13, 2, 6] have established various interestingproperties of the spectral distributions of these matrices in the asymptotic sense (as n, d → ∞ ). However,analyzing algorithms based on kernel methods typically requires understanding of the spectral properties ofthese random kernel matrices for large, but fixed n, d . A similar parallel also holds in the study of the spectralproperties of “traditional” random matrices, where recent developments in the non-asymptotic theory of randommatrices have complemented the classical random matrix theory that was mostly focused on asymptotic spectralproperties [27, 20].We investigate upper bounds on the largest eigenvalue (spectral norm) of random kernel matrices for poly-nomial and Gaussian kernels. We show that for inputs x , . . . , x n drawn independently from a wide class ofprobability distributions over R d (satisfying the subgaussian property), the spectral norm of a random kernel ma-trix constructed using a polynomial kernel of degree p , with high probability, is roughly bounded by O ( d p n ) . In asimilar setting, we show that the spectral norm of a random kernel matrix constructed using a Gaussian kernel isbounded by O ( n ) , and with high probability, this bound reduces to O (1) under some stronger assumptions on thesubgaussian distributions. These bounds are almost tight. Since the entries of a random kernel matrix are highlycorrelated, the existing techniques prevalent in random matrix theory can not be directly applied. We overcomethis problem by careful splitting and conditioning arguments on the random kernel matrix. Combining these withsubgaussian norm concentrations form the basis of our proofs. Applications.
Largest eigenvalue of kernel matrices plays an important role in the analysis of many machinelearning algorithms. Some examples include, bounding the Rademacher complexity for multiple kernel learn-ing [16], analyzing the convergence rate of conjugate gradient technique for matrix-valued kernel learning [26],and establishing the concentration bounds for eigenvalues of kernel matrices [12, 25].In this paper, we focus on an application of these eigenvalue bounds to an important problem arising whileanalyzing sensitive data. Consider a curator who manages a database of sensitive information but wants to releasestatistics about how a sensitive attribute (say, disease) in the database relates with some nonsensitive attributes(e.g., postal code, age, gender, etc). This setting is widely considered in the applied data privacy literature,partly since it arises with medical and retail data. Ridge regression is a well-known approach for solving theseproblems due to its good generalization performance. Kernel ridge regression is a powerful technique for buildingnonlinear regression models that operate by combining ridge regression with kernel methods [21]. We presenta linear reconstruction attack that reconstructs, with high probability, almost all the sensitive attribute entriesgiven sufficiently accurate approximation of the kernel ridge regression coefficients. We consider reconstructionattacks against attribute privacy , a loose notion of privacy, where the goal is to just avoid any gross violation ofprivacy. Concretely, the input is assumed to be a database whose i th row (record for individual i ) is ( x i , y i ) where x i ∈ R d is assumed to be known to the attacker (public information) and y i ∈ { , } is the sensitive attribute,and a privacy mechanism is attribute non-private if the attacker can consistently reconstruct a large fraction ofthe sensitive attribute ( y , . . . , y n ). We show that any privacy mechanism that always adds ≈ o (1 / ( d p n )) noise to each coefficient of a polynomial kernel ridge regression model is attribute non-private. Similarly any privacymechanism that always adds ≈ o (1) noise to each coefficient of a Gaussian kernel ridge regression model isattribute non-private. As we later discuss, there exists natural settings of inputs under which these kernel ridgeregression coefficients, even without the privacy constraint, have the same magnitude as these noise bounds,implying that privacy comes at a steep price. While the linear reconstruction attacks employed in this paperthemselves are well-known [9, 15, 14], these are the first attribute privacy lower bounds that: (i) are applicableto any kernel method and (ii) work for any d -dimensional data, analyses of all previous attacks (for other release We provide a brief coverage of the basics of kernel ridge regression in Section 4. In a linear reconstruction attack, given the released information ρ , the attacker constructs a system of approximate linear equalitiesof the form A z ≈ ρ for a matrix A and attempts to solve for z . Ignoring the dependence on other parameters, including the regularization parameter of ridge regression. d to be comparable to n . Additionally, unlike previous reconstruction attack analyses, ourbounds hold for a wide class of realistic distributional assumptions on the data. In this paper, we study the largest eigenvalue of an n × n random kernel matrix in the non-asymptotic sense. Thegeneral goal with studying non-asymptotic theory of random matrices is to understand the spectral propertiesof random matrices, which are valid with high probability for matrices of a large fixed size. This is contrastwith the existing theory on random kernel matrices which have focused on the asymptotics of various spectralcharacteristics of these random matrices, when the dimensions of the matrices tend to infinity. Let x , . . . , x n ∈ R d be n i.i.d. random vectors. For any F : R d × R d × R → R , symmetric in the first two variables, considerthe random kernel matrix K with ( i, j ) th entry K ij = F ( x i , x j , d ) . El Karoui [13] considered the case where K is generated by either the inner-product kernels (i.e., F ( x i , x j , d ) = f ( (cid:104) x i , x j (cid:105) , d ) ) or the distance kernels (i.e., F ( x i , x j , d ) = f ( (cid:107) x i − x j (cid:107) , d ) ). It was shown there that under some assumptions on f and on the distributionsof x i ’s, and in the “large d , large n ” limit (i.e., and d, n → ∞ and d/n → (0 , ∞ ) ): a) the non-linear kernelmatrix converges asymptotically in spectral norm to a linear kernel matrix, and b) there is a weak convergence ofthe limiting spectral density. These results were recently strengthened in different directions by Cheng et al. [2]and Do et al. [6]. To the best of our knowledge, ours is the first paper investigating the non-asymptotic spectralproperties of a random kernel matrix.Like the development of non-asymptotic theory of traditional random matrices has found multitude of ap-plications in areas including statistics, geometric functional analysis, and compressed sensing [27], we believethat the growth of a non-asymptotic theory of random kernel matrices will help in better understanding of manymachine learning applications that utilize kernel techniques.The goal of private data analysis is to release global, statistical properties of a database while protecting theprivacy of the individuals whose information the database contains. Differential privacy [7] is a formal notion ofprivacy tailored to private data analysis. Differential privacy requires, roughly, that any single individual’s datahave little effect on the outcome of the analysis. A lot of recent research has gone in developing differentiallyprivate algorithms for various applications, including kernel methods [11]. A typical objective here is to releaseas accurate an approximation as possible to some function f evaluated on a database D .In this paper, we follow a complementary line of work that seeks to understand how much distortion (noise)is necessary to privately release some particular function f evaluated on a database containing sensitive informa-tion [5, 8, 9, 15, 4, 18, 3, 19, 14]. The general idea here, is to provide reconstruction attacks , which are attacksthat can reconstruct (almost all of) the sensitive part of database D given sufficiently accurate approximations to f ( D ) . Reconstruction attacks violate any reasonable notion of privacy (including, differential privacy), and theexistence of these attacks directly translate into lower bounds on distortion needed for privacy.Linear reconstruction attacks were first considered in the context of data privacy by Dinur and Nissim [5], whoshowed that any mechanism which answers ≈ n log n random inner product queries on a database in { , } n with o ( √ n ) noise per query is not private. Their attack was subsequently extended in various directions by [8, 9, 18, 3].The results that are closest to our work are the attribute privacy lower bounds analyzed for releasing k -waymarginals [15, 4], linear/logistic regression parameters [14], and a subclass of statistical M -estimators [14].Kasiviswanathan et al. [15] showed that, if d = ˜Ω( n / ( k − ) , then any mechanism which releases all k -waymarginal tables with o ( √ n ) noise per entry is attribute non-private. These noise bounds were improved byDe [4], who presented an attack that can tolerate a constant fraction of entries with arbitrarily high noise, as longas the remaining entries have o ( √ n ) noise. Kasiviswanathan et al. [14] recently showed that, if d = Ω( n ) , thenany mechanism which releases d different linear or logistic regression estimators each with o (1 / √ n ) noise isattribute non-private. They also showed that this lower bound extends to a subclass of statistical M -estimator The ˜Ω notation hides polylogarithmic factors. d has to be comparable to n , andthis dependency looks unavoidable in those results due to their use of least singular value bounds. However, inthis paper, our privacy lower bounds hold for all values of d, n ( d could be (cid:28) n ). Additionally, all the previousreconstruction attack analyses critically require the x i ’s to be drawn from product of univariate subgaussiandistributions, whereas our analysis here holds for any d -dimensional subgaussian distributions (not necessarilyproduct distributions), thereby is more widely applicable. The subgaussian assumption on the input data is quitecommon in the analysis of machine learning algorithms [1]. Notation.
We use [ n ] to denote the set { , . . . , n } . d H ( · , · ) measures the Hamming distance. Vectors usedin the paper are by default column vectors and are denoted by boldface letters. For a vector v , v (cid:62) denotes itstranspose and (cid:107) v (cid:107) denotes its Euclidean norm. For two vectors v and v , (cid:104) v , v (cid:105) denotes the inner product of v and v . For a matrix M , (cid:107) M (cid:107) denotes its spectral norm, (cid:107) M (cid:107) F denotes its Frobenius norm, and M ij denotesits ( i, j ) th entry. I n represents the identity matrix in dimension n . The unit sphere in d dimensions centered atorigin is denoted by S d − = { z : (cid:107) z (cid:107) = 1 , z ∈ R d } . Throughout this paper C, c, C (cid:48) , also with subscripts,denote absolute constants (i.e., independent of d and n ), whose value may change from line to line. We provide a very brief introduction to the theory of kernel methods; see the many books on the topic [23, 24]for further details.
Definition 1 (Kernel Function) . Let X be a non-empty set. Then a function κ : X × X → R is called a kernelfunction on X if there exists a Hilbert space H over R and a map φ : X → H such that for all x , y ∈ X , wehave κ ( x , y ) = (cid:104) φ ( x ) , φ ( y ) (cid:105) H . For any symmetric and positive semidefinite kernel κ , by Mercer’s theorem [17] there exists: (i) a uniquefunctional Hilbert space H (referred to as the reproducing kernel Hilbert space, Definition 2) on X such that κ ( · , · ) is the inner product in the space and (ii) a map φ defined as φ ( x ) := κ ( · , x ) that satisfies Definition 1.The function φ is called the feature map and the space H is called the feature space . Definition 2 (Reproducing Kernel Hilbert Space) . A kernel κ ( · , · ) is a reproducing kernel of a Hilbert space H if ∀ f ∈ H , f ( x ) = (cid:104) κ ( · , x ) , f ( · ) (cid:105) H . For a (compact) X ⊆ R d , and a Hilbert space H of functions f : X → R , wesay H is a Reproducing Kernel Hilbert Space if there ∃ κ : X × X → R , s.t.: a) κ has the reproducing property,and b) κ spans H = span { κ ( · , x ) : x ∈ X } . A standard idea used in the machine-learning community (commonly referred to as the “kernel trick”) is thatkernels allow for the computation of inner-products in high-dimensional feature spaces ( (cid:104) φ ( x ) , φ ( y ) (cid:105) H ) usingsimple functions defined on pairs of input patterns ( κ ( x , y ) ), without knowing the φ mapping explicitly. Thistrick allows one to efficiently solve a variety of non-linear optimization problems. Note that there is no restrictionon the dimension of the feature maps ( φ ( x ) ), i.e., it could be of infinite dimension.Polynomial and Gaussian are two popular kernel functions that are used in many machine learning and datamining tasks such as classification, regression, ranking, and structured prediction. Let the input space X = R d .For x , y ∈ R d , these kernels are defined as: A positive definite kernel is a function κ : X × X → R such that for any n ≥ , for any finite set of points { x i } ni =1 in X and realnumbers { a i } ni =1 , we have (cid:80) ni,j =1 a i a j κ ( x i , x j ) ≥ . κ ( · , x ) is a vector with entries κ ( x (cid:48) , x ) for all x (cid:48) ∈ X .
1) Polynomial Kernel : κ ( x , y ) = ( a (cid:104) x , y (cid:105) + b ) p , with parameters a, b ∈ R and p ∈ N . Here a is referredto as the slope parameter, b ≥ trades off the influence of higher-order versus lower-order terms in thepolynomial, and p is the polynomial degree. For an input x ∈ R d , the feature map φ ( x ) of the polynomialkernel is a vector with a polynomial in d number of dimensions [23]. (2) Gaussian Kernel (also frequently referred to as the radial basis kernel ): κ ( x , y ) = exp (cid:0) − a (cid:107) x − y (cid:107) (cid:1) with real parameter a > . The value of a controls the locality of the kernel with low values indicatingthat the influence of a single point is “far” and vice-versa [23]. An equivalent popular formulation, is to set a = 1 / σ , and hence, κ ( x , y ) = exp (cid:0) −(cid:107) x − y (cid:107) / σ (cid:1) . For an input x ∈ R d , the feature map φ ( x ) ofthe Gaussian kernel is a vector of infinite dimensions [23]. Note that while we focus on the Gaussian kernelin this paper, the extension of our results to other exponential kernels such as the Laplacian kernel (where κ ( x , y ) = exp ( − a (cid:107) x − y (cid:107) ) ), is quite straightforward. Let us start by formally defining subgaussian random variables and vectors.
Definition 3 (Subgaussian Random Variable and Vector) . We call a random variable x ∈ R subgaussian if thereexists a constant C > if Pr[ | x | > t ] ≤ − t /C ) for all t > . We say that a random vector x ∈ R d issubgaussian if the one-dimensional marginals (cid:104) x , y (cid:105) are subgaussian random variables for all y ∈ R d . The class of subgaussian random variables includes many random variables that arise naturally in data anal-ysis, such as standard normal, Bernoulli, spherical, bounded (where the random variable x satisfies | x | ≤ M almost surely for some fixed M ). The natural generalization of these random variables to higher dimension areall subgaussian random vectors. For many isotropic convex sets K (such as the hypercube), a random vector x uniformly distributed in K is subgaussian. Definition 4 (Norm of Subgaussian Random Variable and Vector) . The ψ -norm of a subgaussian random vari-able x ∈ R , denoted by (cid:107) x (cid:107) ψ is: (cid:107) x (cid:107) ψ = inf (cid:8) t > E [exp( | x | /t )] ≤ (cid:9) . The ψ -norm of a subgaussian random vector x ∈ R d is: (cid:107) x (cid:107) ψ = sup y ∈ S d − (cid:107)(cid:104) x , y (cid:105)(cid:107) ψ . Claim 1 (Vershynin [27]) . Let x ∈ R d be a subgaussian random vector. Then there exists a constant C > ,such that Pr[ | x | > t ] ≤ − Ct / (cid:107) x (cid:107) ψ ) . Consider a subset T of R d , and let (cid:15) > . An (cid:15) -net of T is a subset N ⊆ T such that for every x ∈ T , thereexists a z ∈ N such that (cid:107) x − z (cid:107) ≤ (cid:15) . We would use the following well-known result about the size of (cid:15) -nets. Proposition 2.1 (Bounding the size of an (cid:15) -Net [27]) . Let T be a subset of S d − and let (cid:15) > . Then there existsan (cid:15) -net of T of cardinality at most (1 + 2 /(cid:15) ) d . The proof of the following claim follows by standard techniques.
Claim 2 ( [27]) . Let N be a / -net of S d − . Then for any x ∈ R d , (cid:107) x (cid:107) ≤ y ∈N (cid:104) x , y (cid:105) . A convex set K in R d is called isotropic if a random vector chosen uniformly from K according to the volume is isotropic. A randomvector x ∈ R d is isotropic if for all y ∈ R d , E [ (cid:104) x , y (cid:105) ] = (cid:107) y (cid:107) . Largest Eigenvalue of Random Kernel Matrices
In this section, we provide the upper bound on the largest eigenvalue of a random kernel matrix, constructedusing polynomial or Gaussian kernels. Notice that the entries of a random kernel matrix are dependent. Forexample any triplet of entries ( i, j ) , ( j, k ) and ( k, i ) are mutually dependent. Additionally, we deal with vectorsdrawn from general subgaussian distributions, and therefore, the coordinates within a random vector need not beindependent.We start off with a simple lemma, to bound the Euclidean norm of a subgaussian random vector. A randomvector x is centered if E [ x ] = 0 . Lemma 3.1.
Let x , . . . , x n ∈ R d be independent centered subgaussian vectors. Then for all i ∈ [ n ] , Pr[ (cid:107) x i (cid:107) ≥ C √ d ] ≤ exp( − C (cid:48) d ) for constants C, C (cid:48) .Proof.
To this end, note that since x i is a subgaussian vector (from Definition 3) Pr (cid:104) |(cid:104) x i , y (cid:105)| ≥ C √ d/ (cid:105) ≤ − C d ) , for constants C and C , any unit vector y ∈ S d − . Taking the union bound over a (1 / -net ( N ) in S d − , andusing Proposition 2.1 for the size of the nets (which is at most d as (cid:15) = 1 / ), we get that Pr (cid:20) max y ∈N |(cid:104) x i , y (cid:105)| ≥ C √ d/ (cid:21) ≤ exp( − C d ) , From Claim 2, we know that (cid:107) x i (cid:107) ≤ y ∈N (cid:104) x i , y (cid:105) . Hence, Pr (cid:104) (cid:107) x i (cid:107) ≥ C √ d (cid:105) ≤ exp( − C (cid:48) d ) . Polynomial Kernel.
We now establish the bound on the spectral norm of a polynomial kernel random matrix.We assume x , . . . , x n are independent vectors drawn according to a centered subgaussian distribution over R d .Let K p denote the kernel matrix obtained using x , . . . , x n in a polynomial kernel. Our idea to split the kernelmatrix K p into its diagonal and off-diagonal parts, and then bound the spectral norms of these two matricesseparately. The diagonal part contains independent entries of the form ( a (cid:107) x i (cid:107) + b ) p , and we use Lemma 3.1 tobound its spectral norm. Dealing with the off-diagonal part of K p is trickier because of the dependence betweenthe entries, and here we bound the spectral norm by its Frobenius norm. We also verify the upper bounds providedin the following theorem by conducting numerical experiments (see Figure 1(a)). Theorem 3.2.
Let x , . . . , x n ∈ R d be independent centered subgaussian vectors. Let p ∈ N , and let K p be the n × n matrix with ( i, j ) th entry K p ij = ( a (cid:104) x i , x j (cid:105) + b ) p . Assume that n ≤ exp( C d ) for a constant C . Thenthere exists constants C , C (cid:48) such that Pr (cid:2) (cid:107) K p (cid:107) ≥ C p | a | p d p n + 2 p +1 | b | p n (cid:3) ≤ exp( − C (cid:48) d ) . Proof.
To prove the theorem, we split the kernel matrix K p into the diagonal and off-diagonal parts. Let K p = D + W , where D represents the diagonal part of K p and W the off-diagonal part of K p . Note that (cid:107) K p (cid:107) ≤ (cid:107) D (cid:107) + (cid:107) W (cid:107) ≤ (cid:107) D (cid:107) + (cid:107) W (cid:107) F . Let us estimate the norm of the diagonal part D first. From Lemma 3.1, we know that for all i ∈ [ n ] with C = C (cid:48) , Pr (cid:104) (cid:107) x i (cid:107) ≥ C √ d (cid:105) = Pr (cid:104) (cid:107) x i (cid:107) ≥ ( C √ d ) (cid:105) ≤ exp( − C d ) . (cid:107) x (cid:107) i , we are interested in bounding ( a (cid:107) x i (cid:107) + b ) p . Pr (cid:104) (cid:107) x i (cid:107) ≥ ( C √ d ) (cid:105) = Pr (cid:104) ( a (cid:107) x i (cid:107) + b ) p ≥ ( a ( C √ d ) + b ) p (cid:105) . (1)Consider ( a ( C √ d ) + b ) p . A simple inequality to bound ( a ( C √ d ) + b ) p is ( a ( C √ d ) + b ) p ≤ p ( | a | p ( C √ d ) p + | b | p ) . Therefore, Pr (cid:104) ( a (cid:107) x i (cid:107) + b ) p ≥ p ( | a | p ( C √ d ) p + | b | p ) (cid:105) ≤ Pr (cid:104) ( a (cid:107) x i (cid:107) + b ) p ≥ ( a ( C √ d ) + b ) p (cid:105) . Using (1) and substituting in the above equation, for any i ∈ [ n ]Pr (cid:2) ( a (cid:107) x i (cid:107) + b ) p ≥ p ( | a | p C p d p + | b | p ) (cid:3) ≤ Pr (cid:104) (cid:107) x i (cid:107) ≥ C √ d (cid:105) ≤ exp( − C d ) . By applying a union bound over all n non-zero entries in D , we get that for all i ∈ [ n ]Pr (cid:2) ( a (cid:107) x i (cid:107) + b ) p ≥ p ( | a | p C p d p + | b | p ) (cid:3) ≤ n · exp( − C d ) ≤ exp( C d ) · exp( − C d ) ≤ exp( − C d ) , as we assumed that n ≤ exp( C d ) . This implies that Pr[ (cid:107) D (cid:107) ≥ p ( | a | p C p d p + | b | p )] ≤ exp( − C d ) . (2)We now bound the spectral norm of the off-diagonal part W using Frobenius norm as an upper bound on thespectral norm. Firstly note that for any y ∈ R d , the random variable (cid:104) x i , y (cid:105) is subgaussian with its ψ -norm atmost C (cid:107) y (cid:107) for some constant C . This follows as: (cid:107)(cid:104) x i , y (cid:105)(cid:107) ψ := inf (cid:8) t > E [exp( (cid:104) x i , y (cid:105) /t )] ≤ (cid:9) ≤ C (cid:107) y (cid:107) . Therefore, for a fixed x j , (cid:107)(cid:104) x i , x j (cid:105)(cid:107) ψ ≤ C (cid:107) x j (cid:107) . For i (cid:54) = j , conditioning on x j , Pr [ |(cid:104) x i , x j (cid:105)| ≥ τ ] = E x j [Pr [ |(cid:104) x i , x j (cid:105)| ≥ τ | x j ]] . From Claim 1, E x j [Pr [ |(cid:104) x i , x j (cid:105)| ≥ τ | x j ]] ≤ E x j (cid:34) exp (cid:32) − C τ (cid:107)(cid:104) x i , x j (cid:105)(cid:107) ψ (cid:33)(cid:35) ≤ E x j (cid:20) exp (cid:18) − C τ ( C (cid:107) x j (cid:107) ) (cid:19)(cid:21) = E x j (cid:20) exp (cid:18) − C τ (cid:107) x j (cid:107) (cid:19)(cid:21) , where the last inequality uses the fact that (cid:107)(cid:104) x i , x j (cid:105)(cid:107) ψ ≤ C (cid:107) x j (cid:107) . Now let us condition the above expectationon the value of (cid:107) x j (cid:107) based on whether (cid:107) x j (cid:107) ≥ C √ d or (cid:107) x j (cid:107) < C √ d . We can rewrite E x j (cid:20) − C τ (cid:107) x j (cid:107) (cid:21) ≤ E x j (cid:34) exp (cid:18) − C τ C d (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) x j (cid:107) < C √ d (cid:35) Pr[ (cid:107) x j (cid:107) < C √ d ]+ E x j (cid:34) exp (cid:18) − C τ (cid:107) x j (cid:107) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) x j (cid:107) ≥ C √ d (cid:35) Pr[ (cid:107) x j (cid:107) ≥ C √ d ] . For any a, b, m ∈ R and p ∈ N , ( a · m + b ) p ≤ p ( | a | p | m | p + | b | p ) . E x j (cid:20) − C τ (cid:107) x j (cid:107) (cid:21) ≤ exp (cid:18) − C τ d (cid:19) + E x j (cid:34) exp (cid:18) − C τ (cid:107) x j (cid:107) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) x j (cid:107) ≥ C √ d (cid:35) Pr[ (cid:107) x j (cid:107) ≥ C √ d ] . From Lemma 3.1,
Pr[ (cid:107) x j (cid:107) ≥ C √ d ] ≤ exp( − C d ) , and E x j (cid:34) exp (cid:18) − C τ (cid:107) x j (cid:107) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) x j (cid:107) ≥ C √ d (cid:35) ≤ . This implies that as
Pr[ (cid:107) x j (cid:107) ≥ C √ d ] ≤ exp( − C d ) ), E x j (cid:34) exp (cid:18) − C τ (cid:107) x j (cid:107) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) x j (cid:107) ≥ C √ d (cid:35) Pr[ (cid:107) x j (cid:107) ≥ C √ d ] ≤ exp( − C d ) . Putting the above arguments together,
Pr [ |(cid:104) x i , x j (cid:105)| ≥ τ ] = E x j [Pr [ |(cid:104) x i , x j (cid:105)| ≥ τ | x j ]] ≤ exp (cid:18) − C τ d (cid:19) + exp( − C d ) . Taking a union bound over all ( n − n ) < n non-zero entries in W , Pr (cid:20) max i (cid:54) = j |(cid:104) x i , x j (cid:105)| ≥ τ (cid:21) ≤ n (cid:18) exp (cid:18) − C τ d (cid:19) + exp( − C d ) (cid:19) . Setting τ = C · d in the above and using the fact that n ≤ exp( C d ) , Pr (cid:20) max i (cid:54) = j |(cid:104) x i , x j (cid:105)| ≥ C · d (cid:21) ≤ exp( − C d ) . (3)We are now ready to bound the Frobenius norm of W . (cid:107) W (cid:107) F = (cid:88) i (cid:54) = j ( a (cid:104) x i , x j (cid:105) + b ) p / ≤ (cid:0) n p (cid:0) | a | p (cid:104) x i , x j (cid:105) p + | b | p (cid:1)(cid:1) / ≤ n p ( | a | p |(cid:104) x i , x j (cid:105)| p + | b | p ) . Plugging in the probabilistic bound on |(cid:104) x i , x j (cid:105)| from (3) gives, Pr [ (cid:107) W (cid:107) F ≥ n p ( | a | p | C p d p + | b | p )] ≤ Pr [ n p ( | a | p |(cid:104) x i , x j (cid:105)| p + | b | p ) ≥ n p ( | a | p | C p d p + | b | p )] ≤ exp( − C d ) . (4)Plugging bounds on (cid:107) D (cid:107) (from (2)) and (cid:107) W (cid:107) F (from (4)) to upper bound (cid:107) K p (cid:107) ≤ (cid:107) D (cid:107) + (cid:107) W (cid:107) F yields thatthere exists constants C and C (cid:48) such that, Pr (cid:2) (cid:107) K p (cid:107) ≥ C p | a | p d p n + 2 p +1 | b | p n (cid:3) ≤ Pr (cid:2) (cid:107) D (cid:107) + (cid:107) W (cid:107) F ≥ C p | a | p d p n + 2 p +1 | b | p n (cid:3) ≤ exp( − C (cid:48) d ) . This completes the proof of the theorem. The chain of constants can easily be estimated starting with the constantin the definition of the subgaussian random variable.
Remark:
Note that for our proofs it is only necessary that x , . . . , x n are independent random vectors, butthey need not be identically distributed. This spectral norm upper bound on K p (again with exponentially highprobability) could be improved to O (cid:16) C p | a | p ( d p + d p/ n ) + 2 p +1 n | b | p (cid:17) , with a slightly more involved analysis (omitted in this extended abstract). For an even p , the expectation of everyindividual entry of the matrix K p is positive, which provides tight examples for this bound.8 aussian Kernel. We now establish the bound on the spectral norm of a Gaussian kernel random matrix. Againassume x , . . . , x n are independent vectors drawn according to a centered subgaussian distribution over R d . Let K g denote the kernel matrix obtained using x , . . . , x n in a Gaussian kernel. Here an upper bound of n on thespectral norm on the kernel matrix follows trivially as all entries of K g are less than equal to . We show that thisbound is tight, in that for small values of a , with high probability the spectral norm is at least Ω( n ) .In fact, it is impossible to obtain better than O ( n ) upper bound on the spectral norm of K g without additionalassumptions on the subgaussian distribution, as illustrated by this example: Consider a distribution over R d ,such that a random vector drawn from this distribution is a zero vector (0) d with probability / and uniformlydistributed over the sphere in R d of radius √ d with probability / . A random vector x drawn from thisdistribution is isotropic and subgaussian, but Pr[ x = (0) d ] = 1 / . Therefore, in x , . . . , x n drawn from thisdistribution, with high probability more than a constant fraction of the vectors will be (0) d . This means that aproportional number of entries of the matrix K g will be , and the norm will be O ( n ) regardless of a .This situation changes, however, when we add the additional assumption that x , . . . , x n have independentcentered subgaussian coordinates (i.e., each x i is drawn from a product distribution formed from some d centeredunivariate subgaussian distributions). In that case, the kernel matrix K g is a small perturbation of the identitymatrix, and we show that the spectral norm of K g is with high probability bounded by an absolute constant (for a = Ω(log n/d ) ). For this proof, similar to Theorem 3.2, we split the kernel matrix into its diagonal and off-diagonal parts. The spectral norm of the off-diagonal part is again bounded by its Frobenius norm. We also verifythe upper bounds presented in the following theorem by conducting numerical experiments (see Figure 1(b)). Theorem 3.3.
Let x , . . . , x n ∈ R d be independent centered subgaussian vectors. Let a > , and let K g be the n × n matrix with ( i, j ) th entry K g ij = exp( − a (cid:107) x i − x j (cid:107) ) . Then there exists constants c, c , c (cid:48) , c such thata) (cid:107) K g (cid:107) ≤ n .b) If a < c /d , Pr [ (cid:107) K g (cid:107) ≥ c n ] ≥ − exp( − c (cid:48) n ) .c) If all the vectors x , . . . , x n satisfy the additional assumption of having independent centered subgaussiancoordinates, and assume n ≤ exp( C d ) for a constant C . Then for any δ > and a ≥ (2 + δ ) log nd , Pr [ (cid:107) K g (cid:107) ≥ ≤ exp( − cζ d ) with ζ > depending only on δ .Proof. Proof of Part a) is straightforward as all entries of K g do not exceed .Let us prove the lower estimate for the norm in Part b). For i = 1 , . . . , n define Z i = n (cid:88) j = n +1 K g ij . From Lemma 3.1 for all i ∈ [ n ] , Pr (cid:104) (cid:107) x i (cid:107) ≥ C √ d (cid:105) ≤ exp( − C (cid:48) d ) . In other words, (cid:107) x i (cid:107) is less than C √ d forall i ∈ [ d ] with probability at least − exp( − C (cid:48) d ) . Let us call this event E . Under E and assumption a < c /d , E [ Z i ] ≥ c n and E [ Z i ] ≤ c n . Therefore, by Paley-Zygmund inequality (under event E ), Pr[ Z i ≥ c n ] ≥ c . (5)Now Z , . . . , Z n are not independent random variables. But if we condition on x n/ , . . . , x n , then Z , . . . , Z n/ become independent (for simplicity, assume that n is divisible by ). Thereafter, an application of Chernoff boundon Z , . . . , Z n/ using the probability bound from (5) (under conditioning on x n/ , . . . , x n and event E ) gives: Pr (cid:2) Z i ≥ c n for at least c n entries Z i ∈ { Z , . . . , Z n/ } (cid:3) ≥ − exp( − c n ) . Some of the commonly used subgaussian random vectors such as the standard normal, Bernoulli satisfy this additional assumption. x n/ , . . . , x n without disturbingthe exponential probability bound. Similarly, conditioning on event E can also be easily removed.Let K (cid:48) g be the submatrix of K g consisting of rows ≤ i ≤ n/ and columns n/ ≤ j ≤ n . Note that (cid:107) K (cid:48) g (cid:107) ≥ u (cid:62) K (cid:48) g u , where u = (cid:16)(cid:113) n , . . . , (cid:113) n (cid:17) (of dimension n/ ). Then Pr[ (cid:107) K g (cid:107) ≤ c n ] ≤ Pr[ (cid:107) K (cid:48) g (cid:107) ≤ c n ] ≤ Pr[ u (cid:62) K (cid:48) g u ≤ c n ]Pr n n/ (cid:88) i =1 Z i ≤ c n ≤ exp( − c (cid:48) n ) . The last line follows as from above arguments with exponentially high probability above more than Ω( n ) entriesin Z , . . . , Z n/ are greater than Ω( n ) , and by readjusting the constants.Proof of Part c): As in Theorem 3.2, we split the matrix K g into the diagonal ( D ) and the off-diagonal part( W ) (i.e., K g = D + W ). It is simple to observe that D = I n , therefore we just concentrate on W . The ( i, j ) th entry in W is exp( − a (cid:107) x i − x j (cid:107) ) , where x i and x j are independent vectors with independent centeredsubgaussian coordinates. Therefore, we can use Hoeffding’s inequality, for fixed i, j , Pr (cid:2) exp( − a (cid:107) x i − x j (cid:107) ) ≥ exp( − a (1 − ζ ) d ) (cid:3) = Pr (cid:20) (cid:107) x i − x j (cid:107) d ≤ (1 − ζ ) (cid:21) ≤ exp( − c ζ d ) , (6)where we used the fact that if a random variable is subgaussian then its square is a subexponential randomvariable [27]. To estimate the norm of W , we bound it by its Frobenius norm. If a ≥ (2 + δ ) log nd , then we canchoose ζ > depending on δ such that n exp( − a (1 − ζ ) d ) ≤ . Hence, Pr[ (cid:107) K g (cid:107) ≥ ≤ Pr[ (cid:107) D (cid:107) + (cid:107) W (cid:107) F ≥
2] = Pr[ (cid:107) W (cid:107) F ≥ (cid:88) ≤ i,j ≤ n,i (cid:54) = j exp( − a (cid:107) x i − x j (cid:107) ) ≥ ≤ Pr (cid:88) ≤ i,j ≤ n,i (cid:54) = j exp( − a (cid:107) x i − x j (cid:107) ) ≥ n exp( − a (1 − ζ ) d ) ≤ Pr (cid:88) ≤ i,j ≤ n exp( − a (cid:107) x i − x j (cid:107) ) ≥ n exp( − a (1 − ζ ) d ) ≤ n Pr (cid:20) max ≤ i,j ≤ n exp( − a (cid:107) x i − x j (cid:107) ) ≥ exp( − a (1 − ζ ) d ) (cid:21) ≤ n exp( − c ζ d ) ≤ exp( − cζ d ) for some constant c. The first equality follows as (cid:107) D (cid:107) = 1 , and the second-last inequality follows from (6). This completes theproof of the theorem. Again the long chain of constants can easily be estimated starting with the constant in thedefinition of the subgaussian random variable. Remark:
Note that again the x i ’s need not be identically distributed. Also as mentioned earlier, the analysis inTheorem 3.3 could easily be extended to other exponential kernels such as the Laplacian kernel. We call a random variable x ∈ R subexponential if there exists a constant C > if Pr[ | x | > t ] ≤ − t/C ) for all t > . Log o f t he La r ge s t E i gen v a l ue ( a v e r aged o v e r r un s ) Kernel matrix size (n) Acutal ValueUpper Bound from Theorem 3.2 (a) Polynomial Kernel La r ge s t E i gen v a l ue ( a v e r aged o v e r r un s ) Kernel matrix size (n) Acutal ValueUpper Bound from Theorem 3.3 (Part c) (b) Gaussian Kernel
Figure 1:
Largest eigenvalue distribution for random kernel matrices constructed with a polynomial kernel (left plot) anda Gaussian kernel (right plot). The actual value plots are constructed by averaging over 100 runs, and in each run wedraw n independent standard Gaussian vectors in d = 100 dimensions. The predicted values are computed from bounds inTheorems 3.2 and 3.3 (Part c). The kernel matrix size n is varied from to in multiples of . For the polynomialkernel, we set a = 1 , b = 1 , and p = 4 , and for the Gaussian kernel a = 3 log( n ) /d . Note that our upper bounds are fairlyclose to the actual results. For the Gaussian kernel, the actual values are very close to . We consider an application of Theorems 3.2 and 3.3 to obtain noise lower bounds for privately releasing coeffi-cients of kernel ridge regression. For privacy violation, we consider a generalization of blatant non-privacy [5]referred to as attribute non-privacy (formalized in [15]). Consider a database D ∈ R n × d +1 that contains, for eachindividual i , a sensitive attribute y i ∈ { , } as well as some other information x i ∈ R d which is assumed to beknown to the attacker. The i th record is thus ( x i , y i ) . Let X ∈ R n × d be a matrix whose i th row is x i , and let y = ( y , . . . , y n ) . We denote the entire database D = ( X | y ) where | represents vertical concatenation. Givensome released information ρ , the attacker constructs an estimate ˆ y that she hopes is close to y . We measure theattack’s success in terms of the Hamming distance d H ( y , ˆ y ) . A scheme is not attribute private if an attacker canconsistently get an estimate that is within distance o ( n ) . Formally: Definition 5 (Failure of Attribute Privacy [15]) . A (randomized) mechanism M : R n × d +1 → R l is said to allow ( θ, γ ) attribute reconstruction if there exists a setting of the nonsensitive attributes X ∈ R n × d and an algorithm(adversary) A : R n × d × R l → R n such that for every y ∈ { , } n , Pr ρ ←M (( X | y )) [ A ( X, ρ ) = ˆ y : d H ( y , ˆ y ) ≤ θ ] ≥ − γ. Asymptotically, we say that a mechanism is attribute nonprivate if there is an infinite sequence of n for which M allows ( o ( n ) , o (1)) -reconstruction. Here d = d ( n ) is a function of n . We say the attack A is efficient if itruns in time poly ( n, d ) . Kernel Ridge Regression Background.
One of the most basic regression formulation is that of ridge regres-sion [10]. Suppose that we are given a dataset { ( x i , y i ) } ni =1 consisting of n points with x i ∈ R d and y i ∈ R .Here x i ’s are referred to as the regressors and y i ’s are the response variables . In linear regression the task is11o find a linear function that models the dependencies between x i ’s and the y i ’s. A common way to preventoverfitting in linear regression is by adding a penalty regularization term (also known as shrinkage in statistics).In kernel ridge regression [21], we assume a model of form y = f ( x ) + ξ , where we are trying to estimate theregression function f and ξ is some unknown vector that accounts for discrepancy between the actual response( y ) and predicted outcome ( f ( x ) ). Given a reproducing kernel Hilbert space H with kernel κ , the goal of ridgeregression kernel ridge regression is to estimate the unknown function f (cid:63) such the least-squares loss defined overthe dataset with a weighted penalty based on the squared Hilbert norm is minimized.Kernel Ridge Regression: argmin f ∈H (cid:32) n n (cid:88) i =1 ( y i − f ( x i )) + λ (cid:107) f (cid:107) H (cid:33) , (7)where λ > is a regularization parameter. By representer theorem [22], any solution f (cid:63) for (7), takes the form f (cid:63) ( · ) = n (cid:88) i =1 α i κ ( · , x i ) , (8)where α = ( α , . . . , α n ) is known as the kernel ridge regression coefficient vector. Plugging this representationinto (7) and solving the resulting optimization problem (in terms of α now), we get that the minimum value isachieved for α = α (cid:63) , where α (cid:63) = ( K + λ I n ) − y , where K is the kernel matrix with K ij = κ ( x i , x j ) and y = ( y , . . . , y n ) . (9)Plugging this α (cid:63) from (9) in to (8), gives the final form for estimate f (cid:63) ( · ) . This means that for a new point x ∈ R d , the predicted response is f (cid:63) ( x ) = (cid:80) ni =1 α (cid:63)i κ ( x , x i ) where α (cid:63) = ( K + λ I n ) − y and α (cid:63) = ( α (cid:63) , . . . , α (cid:63)n ) .Therefore, knowledge of α (cid:63) and x , . . . , x n suffices for using the regression model for making future predictions.If K is constructed using a polynomial kernel (defined in (1) ) then the above procedure is referred to as the polynomial kernel ridge regression , and similarly if K is constructed using a Gaussian kernel (defined in (2) )then the above procedure is referred to as the Gaussian kernel ridge regression . Reconstruction Attack from Noisy α ∗ . Algorithm 1 outlines the attack. The privacy mechanism releases anoisy approximation to α (cid:63) . Let ˜ α be this noisy approximation, i.e., ˜ α = α (cid:63) + e where e is some unknown noisevector. The adversary tries to reconstruct an approximation ˆ y of y from ˜ α . The adversary solves the following (cid:96) -minimization problem to construct ˆ y :min z ∈ R n (cid:107) ˜ α − ( K + λ I n ) − z (cid:107) . (10)In the setting of attribute privacy, the database D = ( X | y ) . Let x , . . . , x n be the rows of X , using which theadversary can construct K to carry out the attack. Since the matrix K + λ I n is invertible for λ > as K is apositive semidefinite matrix, the solution to (10) is simply z = ( K + λ I n ) ˜ α , element-wise rounding of which toclosest , gives ˆ y . Lemma 4.1.
Let ˜ α = α (cid:63) + e , where e ∈ R n is some unknown (noise) vector. If (cid:107) e (cid:107) ∞ ≤ β (absolute value of allentries in e is less than β ), then ˆ y returned by Algorithm 1 satisfies, d H ( y , ˆ y ) ≤ K + λ ) β n . In particular,if β = o (cid:16) (cid:107) K (cid:107) + λ (cid:17) , then d H ( y , ˆ y ) = o ( n ) .Proof. Since α (cid:63) = ( K + λ I n ) − y , ˜ α = ( K + λ I n ) − y + e . Now multiplying ( K + λ I n ) on both sides gives, ( K + λ I n ) ˜ α = y + ( K + λ I n ) e . lgorithm 1 Reconstruction Attack from Noisy Kernel Ridge Regression Coefficients
Input:
Public information X ∈ R n × d , regularization parameter λ , and ˜ α (noisy version of α (cid:63) defined in (9)). Let x , . . . , x n be the rows of X , construct the kernel matrix K with K ij = κ ( x i , x j ) R eturn ˆ y = (ˆ y , . . . , ˆ y n ) defined as follows: ˆ y i = (cid:26) if i th entry in ( K + λ I n ) ˜ α < / otherwiseConcentrate on (cid:107) ( K + λ I n ) e (cid:107) . This can be bound as (cid:107) ( K + λ I n ) e (cid:107) ≤ (cid:107) ( K + λ I n ) (cid:107)(cid:107) e (cid:107) = ( (cid:107) K (cid:107) + λ ) (cid:107) e (cid:107) . If the absolute value of all the entries in e are less than β then (cid:107) e (cid:107) ≤ β √ n . A simple manipulation then showsthat if the above hold then ( K + λ I n ) e cannot have more than (cid:107) K (cid:107) + λ ) β n entries with absolute valueabove / . Since ˆ y and y only differ in those entries where ( K + λ I n ) e is greater than / , it follows that d H ( y , ˆ y ) ≤ (cid:107) K (cid:107) + λ ) β n . Setting β = o ( (cid:107) K (cid:107) + λ ) implies d H ( y , ˆ y ) = o ( n ) .For a privacy mechanism to be attribute non-private, the adversary has to be able reconstruct an − o (1) fraction of y with high probability. Using the above lemma, and the different bounds on (cid:107) K (cid:107) established in The-orems 3.2 and 3.3, we get the following lower bounds for privately releasing kernel ridge regression coefficients. Proposition 4.2.
1) Any privacy mechanism which for every database D = ( X | y ) where X ∈ R n × d and y ∈ { , } n releases the coefficient vector of a polynomial kennel ridge regression model (for constants a, b, and p ) fitted between X (matrix of regressor values) and y (response vector), by adding o ( d p n + λ ) noise toeach coordinate is attribute non-private. The attack that achieves this attribute privacy violation operates in O ( dn ) time.2) Any privacy mechanism which for every database D = ( X | y ) where X ∈ R n × d and y ∈ { , } n releasesthe coefficient vector of a Gaussian kennel ridge regression model (for constant a ) fitted between X (matrix ofregressor values) and y (response vector), by adding o ( λ ) noise to each coordinate is attribute non-private.The attack that achieves this attribute privacy violation operates in O ( dn ) time.Proof. For Part 1, draw each individual i ’s non-sensitive attribute vector x i independently from any d -dimensionalsubgaussian distribution, and use Lemma 4.1 in conjunction with Theorem 3.2.For Part 2, draw each individual i ’s non-sensitive attribute vector x i independently from any product distribu-tion formed from some d centered univariate subgaussian distributions, and use Lemma 4.1 in conjunction withTheorem 3.3 (Part c). The time needed to construct the kernel matrix K is O ( dn ) , which dominates the overall computationtime.We can ask how the above distortion needed for privacy compares to typical entries in α (cid:63) . The answer isnot simple, but there are natural settings of inputs, where the noise needed for privacy becomes comparable withcoordinates of α (cid:63) , implying that the privacy comes at a steep price. One such example is if the x i ’s are drawn Note that it is not critical for x i ’s to be drawn from a product distribution. It is possible to analyze the attack even under a (weaker) as-sumption that each individual i ’s non-sensitive attribute vector x i is drawn independently from a d -dimensional subgaussian distribution,by using Lemma 4.1 in conjunction with Theorem 3.3 (Part a). y = (1) n , and all other kernel parameters are constant, then the expectedvalue of the corresponding α (cid:63) coordinates match the noise bounds obtained in Proposition 4.2.Note that Proposition 4.2 makes no assumptions on the dimension d of the data, and holds for all valuesof n, d . This is different from all other previous lower bounds for attribute privacy [15, 4, 14], all of whichrequire d to be comparable to n , thereby holding only either when the non-sensitive data (the x i ’s) are very high-dimensional or for very small n . Also all the previous lower bound analyses [15, 4, 14] critically rely on the factthat the individual coordinates of each of the x i ’s are independent , which is not essential for Proposition 4.2. Note on (cid:96) -reconstruction Attacks. A natural alternative to (10) is to use (cid:96) -minimization (also known as “LPdecoding”). This gives rise to the following linear program:min z ∈ R n (cid:107) ˜ α − ( K + λ I n ) − z (cid:107) . (11)In the context of privacy, the (cid:96) -minimization approach was first proposed by Dwork et al. [8], and recentlyreanalyzed in different contexts by [4, 14]. These results have shown that, for some settings, the (cid:96) -minimizationcan handle considerably more complex noise patterns than the (cid:96) -minimization. However, in our setting, sincethe solutions for (11) and (10) are exactly the same ( z = ( K + λ I n ) ˜ α ), there is no inherent advantage of usingthe (cid:96) -minimization. Acknowledgements
We are grateful for helpful initial discussions with Adam Smith and Ambuj Tewari.
References [1] B
OUSQUET , O.,
VON L UXBURG , U.,
AND
R ¨
ATSCH , G. Advanced Lectures on Machine Learning. In
MLSummer Schools 2003 (2004).[2] C
HENG , X.,
AND S INGER , A. The Spectrum of Random Inner-Product Kernel Matrices.
Random Matri-ces: Theory and Applications 2 , 04 (2013).[3] C
HOROMANSKI , K.,
AND M ALKIN , T. The Power of the Dinur-Nissim Algorithm: Breaking Privacy ofStatistical and Graph Databases. In
PODS (2012), ACM, pp. 65–76.[4] D E , A. Lower Bounds in Differential Privacy. In TCC (2012), pp. 321–338.[5] D
INUR , I.,
AND N ISSIM , K. Revealing Information while Preserving Privacy. In
PODS (2003), ACM,pp. 202–210.[6] D O , Y., AND V U , V. The Spectrum of Random Kernel Matrices: Universality Results for Rough andVarying Kernels. Random Matrices: Theory and Applications 2 , 03 (2013).[7] D
WORK , C., M C S HERRY , F., N
ISSIM , K.,
AND S MITH , A. Calibrating Noise to Sensitivity in PrivateData Analysis. In
TCC (2006), vol. 3876 of
LNCS , Springer, pp. 265–284.[8] D
WORK , C., M C S HERRY , F.,
AND T ALWAR , K. The Price of Privacy and the Limits of LP Decoding. In
STOC (2007), ACM, pp. 85–94. This may not be a realistic assumption in many practical scenarios. For example, an individual’s salary and postal address code arecorrelated and not independent.
WORK , C.,
AND Y EKHANIN , S. New Efficient Attacks on Statistical Disclosure Control Mechanisms.In
CRYPTO (2008), Springer, pp. 469–480.[10] H
OERL , A. E.,
AND K ENNARD , R. W. Ridge Regression: Biased Estimation for Nonorthogonal Problems.
Technometrics 12 , 1 (1970), 55–67.[11] J
AIN , P.,
AND T HAKURTA , A. Differentially Private Learning with Kernels. In
ICML (2013), pp. 118–126.[12] J IA , L., AND L IAO , S. Accurate Probabilistic Error Bound for Eigenvalues of Kernel Matrix. In
Advancesin Machine Learning . Springer, 2009, pp. 162–175.[13] K
AROUI , N. E. The Spectrum of Kernel Random Matrices.
The Annals of Statistics (2010), 1–50.[14] K
ASIVISWANATHAN , S. P., R
UDELSON , M.,
AND S MITH , A. The Power of Linear Reconstruction At-tacks. In
SODA (2013), pp. 1415–1433.[15] K
ASIVISWANATHAN , S. P., R
UDELSON , M., S
MITH , A.,
AND U LLMAN , J. The Price of Privately Re-leasing Contingency Tables and the Spectra of Random Matrices with Correlated Rows. In
STOC (2010),pp. 775–784.[16] L
ANCKRIET , G. R., C
RISTIANINI , N., B
ARTLETT , P., G
HAOUI , L. E.,
AND J ORDAN , M. I. Learningthe Kernel Matrix with Semidefinite Programming.
The Journal of Machine Learning Research 5 (2004),27–72.[17] M
ERCER , J. Functions of Positive and Negative Type, and their Connection with the Theory of IntegralEquations.
Philosophical transactions of the royal society of London. Series A, containing papers of amathematical or physical character (1909), 415–446.[18] M
ERENER , M. M. Polynomial-time Attack on Output Perturbation Sanitizers for Real-valued Databases.
Journal of Privacy and Confidentiality 2 , 2 (2011), 5.[19] M
UTHUKRISHNAN , S.,
AND N IKOLOV , A. Optimal Private Halfspace Counting via Discrepancy. In
STOC (2012), pp. 1285–1292.[20] R
UDELSON , M. Recent Developments in Non-asymptotic Theory of Random Matrices.
Modern Aspectsof Random Matrix Theory 72 (2014), 83.[21] S
AUNDERS , C., G
AMMERMAN , A.,
AND V OVK , V. Ridge Regression Learning Algorithm in Dual Vari-ables. In
ICML (1998), pp. 515–521.[22] S CH ¨ OLKOPF , B., H
ERBRICH , R.,
AND S MOLA , A. J. A Generalized Representer Theorem. In
COLT (2001), pp. 416–426.[23] S
CHOLKOPF , B.,
AND S MOLA , A. J.
Learning with Kernels: Support Vector Machines, Regularization,Optimization, and Beyond . MIT Press, 2001.[24] S
HAWE -T AYLOR , J.,
AND C RISTIANINI , N.
Kernel Methods for Pattern Analysis . Cambridge UniversityPress, 2004.[25] S
HAWE -T AYLOR , J., W
ILLIAMS , C. K., C
RISTIANINI , N.,
AND K ANDOLA , J. On the Eigenspectrum ofthe Gram matrix and the Generalization Error of Kernel-PCA.
Information Theory, IEEE Transactions on51 , 7 (2005), 2510–2522. 1526] S
INDHWANI , V., Q
UANG , M. H.,
AND L OZANO , A. C. Scalable Matrix-valued Kernel Learning for High-dimensional Nonlinear Multivariate Regression and Granger Causality. arXiv preprint arXiv:1210.4792 (2012).[27] V
ERSHYNIN , R. Introduction to the Non-asymptotic Analysis of Random Matrices. arXiv preprintarXiv:1011.3027arXiv preprintarXiv:1011.3027