Frank McSherry | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Frank McSherry is active.

Explore More

Publication

Featured researches published by Frank McSherry.

foundations of computer science | 2007

Mechanism Design via Differential Privacy

Frank McSherry; Kunal Talwar

We study the role that privacy-preserving algorithms, which prevent the leakage of specific information about participants, can play in the design of mechanisms for strategic agents, which must encourage players to honestly report information. Specifically, we show that the recent notion of differential privacv, in addition to its own intrinsic virtue, can ensure that participants have limited effect on the outcome of the mechanism, and as a consequence have limited incentive to lie. More precisely, mechanisms with differential privacy are approximate dominant strategy under arbitrary player utility functions, are automatically resilient to coalitions, and easily allow repeatability. We study several special cases of the unlimited supply auction problem, providing new results for digital goods auctions, attribute auctions, and auctions with arbitrary structural constraints on the prices. As an important prelude to developing a privacy-preserving auction mechanism, we introduce and study a generalization of previous privacy work that accommodates the high sensitivity of the auction setting, where a single participant may dramatically alter the optimal fixed price, and a slight change in the offered price may take the revenue from optimal to zero.

symposium on principles of database systems | 2005

Practical privacy: the SuLQ framework

Avrim Blum; Cynthia Dwork; Frank McSherry; Kobbi Nissim

We consider a statistical database in which a trusted administrator introduces noise to the query responses with the goal of maintaining privacy of individual database entries. In such a database, a query consists of a pair (S, f) where S is a set of rows in the database and f is a function mapping database rows to {0, 1}. The true answer is ΣiεS f(di), and a noisy version is released as the response to the query. Results of Dinur, Dwork, and Nissim show that a strong form of privacy can be maintained using a surprisingly small amount of noise -- much less than the sampling error -- provided the total number of queries is sublinear in the number of database rows. We call this query and (slightly) noisy reply the SuLQ (Sub-Linear Queries) primitive. The assumption of sublinearity becomes reasonable as databases grow increasingly large.We extend this work in two ways. First, we modify the privacy analysis to real-valued functions f and arbitrary row types, as a consequence greatly improving the bounds on noise required for privacy. Second, we examine the computational power of the SuLQ primitive. We show that it is very powerful indeed, in that slightly noisy versions of the following computations can be carried out with very few invocations of the primitive: principal component analysis, k means clustering, the Perceptron Algorithm, the ID3 algorithm, and (apparently!) all algorithms that operate in the in the statistical query learning model [11].

knowledge discovery and data mining | 2009

Differentially private recommender systems: Building privacy into the Netflix Prize contenders

Frank McSherry; Ilya Mironov

We consider the problem of producing recommendations from collective user behavior while simultaneously providing guarantees of privacy for these users. Specifically, we consider the Netflix Prize data set, and its leading algorithms, adapted to the framework of differential privacy. Unlike prior privacy work concerned with cryptographically securing the computation of recommendations, differential privacy constrains a computation in a way that precludes any inference about the underlying records from its output. Such algorithms necessarily introduce uncertainty--i.e., noise--to computations, trading accuracy for privacy. We find that several of the leading approaches in the Netflix Prize competition can be adapted to provide differential privacy, without significantly degrading their accuracy. To adapt these algorithms, we explicitly factor them into two parts, an aggregation/learning phase that can be performed with differential privacy guarantees, and an individual recommendation phase that uses the learned correlations and an individuals data to provide personalized recommendations. The adaptations are non-trivial, and involve both careful analysis of the per-record sensitivity of the algorithms to calibrate noise, as well as new post-processing steps to mitigate the impact of this noise. We measure the empirical trade-off between accuracy and privacy in these adaptations, and find that we can provide non-trivial formal privacy guarantees while still outperforming the Cinematch baseline Netflix provides.

Journal of the ACM | 2007

Fast computation of low-rank matrix approximations

Dimitris Achlioptas; Frank McSherry

Given a matrix A, it is often desirable to find a good approximation to A that has low rank. We introduce a simple technique for accelerating the computation of such approximations when A has strong spectral features, that is, when the singular values of interest are significantly greater than those of a random matrix with size and entries similar to A. Our technique amounts to independently sampling and/or quantizing the entries of A, thus speeding up computation by reducing the number of nonzero entries and/or the length of their representation. Our analysis is based on observing that the acts of sampling and quantization can be viewed as adding a random matrix N to A, whose entries are independent random variables with zero-mean and bounded variance. Since, with high probability, N has very weak spectral features, we can prove that the effect of sampling and quantization nearly vanishes when a low-rank approximation to A + N is computed. We give high probability bounds on the quality of our approximation both in the Frobenius and the 2-norm.

symposium on operating systems principles | 2013

Naiad: a timely dataflow system

Derek Gordon Murray; Frank McSherry; Rebecca Isaacs; Michael Isard; Paul Barham; Martín Abadi

Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, applications that require all three have relied on multiple platforms, at the expense of efficiency, maintainability, and simplicity. Naiad resolves the complexities of combining these features in one framework. A new computational model, timely dataflow, underlies Naiad and captures opportunities for parallelism across a wide class of algorithms. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism. We show that many powerful high-level programming models can be built on Naiads low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining. Naiad outperforms specialized systems in their target application domains, and its unique features enable the development of new high-performance applications.

theory and application of cryptographic techniques | 2006

Our data, ourselves: privacy via distributed noise generation

Cynthia Dwork; Krishnaram Kenthapadi; Frank McSherry; Ilya Mironov; Moni Naor

In this work we provide efficient distributed protocols for generating shares of random noise, secure against malicious participants. The purpose of the noise generation is to create a distributed implementation of the privacy-preserving statistical databases described in recent papers [14,4,13]. In these databases, privacy is obtained by perturbing the true answer to a database query by the addition of a small amount of Gaussian or exponentially distributed random noise. The computational power of even a simple form of these databases, when the query is just of the form ∑if(di), that is, the sum over all rows i in the database of a function f applied to the data in row i, has been demonstrated in [4]. A distributed implementation eliminates the need for a trusted database administrator. The results for noise generation are of independent interest. The generation of Gaussian noise introduces a technique for distributing shares of many unbiased coins with fewer executions of verifiable secret sharing than would be needed using previous approaches (reduced by a factor of n). The generation of exponentially distributed noise uses two shallow circuits: one for generating many arbitrarily but identically biased coins at an amortized cost of two unbiased random bits apiece, independent of the bias, and the other to combine bits of appropriate biases to obtain an exponential distribution.

theory of cryptography conference | 2005

Toward privacy in public databases

Shuchi Chawla; Cynthia Dwork; Frank McSherry; Adam D. Smith; Hoeteck Wee

We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents and utility of the sanitized data. Unlike in the study of secure function evaluation, in which privacy is preserved to the extent possible given a specific functionality goal, in the census problem privacy is paramount; intuitively, things that cannot be learned “safely” should not be learned at all. An important contribution of this work is a definition of privacy (and privacy compromise) for statistical databases, together with a method for describing and comparing the privacy offered by specific sanitization techniques. We obtain several privacy results using two different sanitization techniques, and then show how to combine them via cross training. We also obtain two utility results involving clustering.

symposium on the theory of computing | 2007

The price of privacy and the limits of LP decoding

Cynthia Dwork; Frank McSherry; Kunal Talwar

This work is at theintersection of two lines of research. One line, initiated by Dinurand Nissim, investigates the price, in accuracy, of protecting privacy in a statistical database. The second, growing from an extensive literature on compressed sensing (see in particular the work of Donoho and collaborators [4,7,13,11])and explicitly connected to error-correcting codes by Candès and Tao ([4]; see also [5,3]), is in the use of linearprogramming for error correction. Our principal result is the discovery of a sharp threshhold ρ*∠ 0.239, so that if ρ < ρ* and A is a random m x n encoding matrix of independently chosen standardGaussians, where m = O(n), then with overwhelming probability overchoice of A, for all x ∈ Rn, LP decoding corrects ⌊ ρ m⌋ arbitrary errors in the encoding Ax, while decoding can be made to fail if the error rate exceeds ρ*. Our boundresolves an open question of Candès, Rudelson, Tao, and Vershyin [3] and (oddly, but explicably) refutesempirical conclusions of Donoho [11] and Candès et al [3]. By scaling and rounding we can easilytransform these results to obtain polynomial-time decodable random linear codes with polynomial-sized alphabets tolerating any ρ < ρ* ∠ 0.239 fraction of arbitrary errors. In the context of privacy-preserving datamining our results say thatany privacy mechanism, interactive or non-interactive, providingreasonably accurate answers to a 0.761 fraction of randomly generated weighted subset sum queries, and arbitrary answers on the remaining 0.239 fraction, is blatantly non-private.

conference on learning theory | 2005

On spectral learning of mixtures of distributions

Dimitris Achlioptas; Frank McSherry

We consider the problem of learning mixtures of distributions via spectral methods and derive a characterization of when such methods are useful. Specifically, given a mixture-sample, let μ i , C i , w i denote the empirical mean, covariance matrix, and mixing weight of the samples from the i-th component. We prove that a very simple algorithm, namely spectral projection followed by single-linkage clustering, properly classifies every point in the sample provided that each pair of means μ i , μ j is well separated, in the sense that ∥μ i - μ j ∥ 2 is at least ∥C i ∥ 2 (1/w i +1/w j ) plus a term that depends on the concentration properties of the distributions in the mixture. This second term is very small for many distributions, including Gaussians, Log-concave, and many others. As a result, we get the best known bounds for learning mixtures of arbitrary Gaussians in terms of the required mean separation. At the same time, we prove that there are many Gaussian mixtures {(μ i , C i , w i )} such that each pair of means is separated by ∥C i ∥ 2 (1/w i + 1/w j ), yet upon spectral projection the mixture collapses completely, i.e., all means and covariance matrices in the projected mixture are identical.

international workshop on peer to peer systems | 2005

A first look at peer-to-peer worms: threats and defenses

Lidong Zhou; Lintao Zhang; Frank McSherry; Nicole Immorlica; Manuel Costa; Steve Chien

Peer-to-peer (P2P) worms exploit common vulnerabilities in member hosts of a P2P network and spread topologically in the P2P network, a potentially more effective strategy than random scanning for locating victims. This paper describes the danger posed by P2P worms and initiates the study of possible mitigation mechanisms. In particular, the paper explores the feasibility of a self-defense infrastructure inside a P2P network, outlines the challenges, evaluates how well this defense mechanism contains P2P worms, and reveals correlations between containment and the overlay topology of a P2P network. Our experiments suggest a number of design directions to improve the resilience of P2P networks to worm attacks.

Explore More