The problem of statistical disclosure control—revealing accurate statistics about a population while preserving the privacy of individuals—has a venerable history. An extensive literature spans multiple disciplines: statistics, theoretical computer science, security, and databases. Nevertheless, despite this extensive literature, «privacy breaches» are common, both in the literature and in practice, even when security and data integrity are not compromised.
This project revisits private data analysis from the perspective of modern cryptography. We address many previous difficulties by obtaining a strong, yet realizable, definition of privacy. Intuitively, differential privacy ensures that the system behaves the essentially same way, independent of whether any individual, or small group of individuals, opts in to or opts out of the database. More precisely, for every possible output of the system, the probability of this output is almost unchanged by the addition or removal of any individual, where the probabilities are taken over the coin flips of the mechanism (and not the data set). Moreover, this holds even in the face of arbitrary existing or future knowledge available to a «privacy adversary,» completely solving the problem of database linkage attacks.
Databases can serve many social goals, such as fair allocation of resources, and identifying genetic markers for disease. Better participation means better information, and the «in vs out» aspect of differential privacy encourages participation.
For a general overview of differential privacy—the problems to be solved, the definition, the formal impossibility results that lead to the definition, general techniques for achieving differential privacy, and some recent directions, see «A firm foundation for private data analysis» (to appear in Communications of ACM).
For selected publications organized by topic and chronological ordered scroll down.
Privacy Integrated Queries (PINQ)
This page lists in the reverse-chronological order all publications related to the project on database privacy organized in three categories:
We study differential privacy in a distributed setting where two parties would like to perform analysis of their joint data while preserving privacy for both datasets. Our results imply almost tight lower bounds on the accuracy of such data analyses, both for specific natural functions (such as Hamming distance) and in general. Our bounds expose a sharp contrast between the two-party setting and the simpler client-server setting (where privacy guarantees are one-sided). In addition, those bounds demonstrate a dramatic gap between the accuracy that can be obtained by differentially private data analysis versus the accuracy obtainable when privacy is relaxed to a computational variant of differential privacy.
The first proof technique we develop demonstrates a connection between differential privacy and deterministic extraction from Santha-Vazirani sources. A second connection we expose indicates that the ability to approximate a function by a low-error differentially-private protocol is strongly related to the ability to approximate it by a low communication protocol. (The connection goes in both directions.)
We consider the noise complexity of differentially private mechanisms in the setting where the user asks d linear queries fcolon Rn to Re non-adaptively. Here, the database is represented by a vector in Rn and proximity between databases is measured in the ℓ_{1}-metric. We show that the noise complexity is determined by two geometric parameters associated with the set of queries. We use this connection to give tight upper and lower bounds on the noise complexity for any d ≤ n. We show that for d random linear queries of sensitivity 1, it is necessary and sufficient to add ℓ_{2}-error Θ(min {d√d/ε,d√log (n/d)/ε}) to achieve ε-differential privacy. Assuming the truth of a deep conjecture from convex geometry, known as the Hyperplane conjecture, we can extend our results to arbitrary linear queries giving nearly matching upper and lower bounds. Our bound translates to error O(min {d/ε,√dlog (n/d)/ε}) per answer. The best previous upper bound (Laplacian mechanism) gives a bound of O(min {d/ε,√n/ε}) per answer, while the best known lower bound was Ω(√d/ε). In contrast, our lower bound is strong enough to separate the concept of differential privacy from the notion of approximate differential privacy where an upper bound of O(√d/ε) can be achieved.
Differential privacy is a recent notion of privacy tailored to privacy-preserving data analysis [11]. Up to this point, research on differentially private data analysis has focused on the setting of a trusted curator holding a large, static, data set; thus every computation is a “one-shot” object: there is no point in computing something twice, since the result will be unchanged, up to any randomness introduced for privacy.
However, many applications of data analysis involve repeated computations, either because the entire goal is one of monitoring, e.g., of traffic conditions, search trends, or incidence of influenza, or because the goal is some kind of adaptive optimization, e.g., placement of data to minimize access costs. In these cases, the algorithm must permit continual observation of the system’s state. We therefore initiate a study of differential privacy under continual observation. We identify the problem of maintaining a counter in a privacy preserving manner and show its wide applicability to many different problems.
Collectors of confidential data, such as governmental agencies, hospitals, or search engine providers, can be pressured to permit data to be used for purposes other than that for which they were collected. To support the data curators, we initiate a study of pan-private algorithms; roughly speaking, these algorithms retain their privacy properties even if their internal state becomes visible to an adversary. Our principal focus is on streaming algorithms, where each datum may be discarded immediately after processing.
Consider the following problem: given a metric space, some of whose points are “clients,” select a set of at most k facility locations to minimize the average distance from the clients to their nearest facility. This is just the well-studied k-median problem, for which many approximation algorithms and hardness results are known. Note that the objective function encourages opening facilities in areas where there are many clients, and given a solution, it is often possible to get a good idea of where the clients are located. This raises the following quandary: what if the locations of the clients are sensitive information that we would like to keep private? Is it even possible to design good algorithms for this problem that preserve the privacy of the clients?
In this paper, we initiate a systematic study of algorithms for discrete optimization problems in the framework of differential privacy (which formalizes the idea of protecting the privacy of individual input elements). We show that many such problems indeed have good approximation algorithms that preserve differential privacy; this is even in cases where it is impossible to preserve cryptographic definitions of privacy while computing any non-trivial approximation to even the value of an optimal solution, let alone the entire solution.
Apart from the k-median problem, we consider the problems of vertex and set cover, min-cut, k-median, facility location, and Steiner tree, and give approximation algorithms and lower bounds for these problems. We also consider the recently introduced submodular maximization problem, “Combinatorial Public Projects” (CPP), shown by Papadimitriou et al. to be inapproximable to subpolynomial multiplicative factors by any efficient and truthful algorithm. We give a differentially private (and hence approximately truthful) algorithm that achieves a logarithmic additive approximation.
In 1977 Tore Dalenius articulated a desideratum for statistical databases: nothing about an individual should be learnable from the database that cannot be learned without access to the database. We give a general impossibility result showing that a natural formalization of Dalenius’ goal cannot be achieved if the database is useful. The key obstacle is the side information that may be available to an adversary. Our results hold under very general conditions regarding the database, the notion of privacy violation, and the notion of utility.
Contrary to intuition, a variant of the result threatens the privacy even of someone not in the database. This state of affairs motivated the notion of differential privacy [15, 16], a strong ad omnia privacy which, intuitively, captures the increased risk to one’s privacy incurred by participating in a database.
The definition of differential privacy has recently emerged as a leading standard of privacy guarantees for algorithms on statistical databases. We offer several relaxations of the definition which require privacy guarantees to hold only against efficient—i.e., computationally-bounded—adversaries. We establish various relationships among these notions, and in doing so, we observe their close connection with the theory of pseudodense sets by Reingold et al. We extend the dense model theorem of Reingold et al. to demonstrate equivalence between two definitions (indistinguishability- and simulatability-based) of computational differential privacy.
Our computational analogues of differential privacy seem to allow for more accurate constructions than the standard information-theoretic analogues. In particular, in the context of private approximation of the distance between two vectors, we present a differentially-private protocol for computing the approximation, and contrast it with a substantially more accurate protocol that is only computationally differentially private.
We consider the problem of producing recommendations from collective user behavior while simultaneously providing guarantees of privacy for these users. Specifically, we consider the Netflix Prize data set, and its leading algorithms, adapted to the framework of differential privacy.
Unlike prior privacy work concerned with cryptographically securing the computation of recommendations, differential privacy constrains a computation in a way that precludes any inference about the underlying records from its output. Such algorithms necessarily introduce uncertainty—i.e., noise—to computations, trading accuracy for privacy.
We find that several of the leading approaches in the Netflix Prize competition can be adapted to provide differential privacy, without significantly degrading their accuracy. To adapt these algorithms, we explicitly factor them into two parts, an aggregation/learning phase that can be performed with differential privacy guarantees, and an individual recommendation phase that uses the learned correlations and an individual’s data to provide personalized recommendations.
The adaptations are non-trivial, and involve both careful analysis of the per-record sensitivity of the algorithms to calibrate noise, as well as new post-processing steps to mitigate the impact of this noise.
We measure the empirical trade-off between accuracy and privacy in these adaptations, and find that we can provide non-trivial formal privacy guarantees while still outperforming the Cinematch baseline Netflix provides.
We report on the design and implementation of the Privacy Integrated Queries (PINQ) platform for privacy-preserving data analysis. PINQ provides analysts with a programming interface to unscrubbed data through a SQL-like language. At the same time, the design of PINQ’s analysis language and its careful implementation provide formal guarantees of differential privacy for any and all uses of the platform. PINQ’s unconditional structural guarantees require no trust placed in the expertise or diligence of the analysts, substantially broadening the scope for design and deployment of privacy-preserving data analysis, especially by non-experts.
We show by means of several examples that robust statistical estimators present an excellent starting point for differentially private estimators. Our algorithms use a new paradigm for differentially private mechanisms, which we call Propose-Test-Release (PTR), and for which we give a formal definition and general composition theorems.
We consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a “sanitization” of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a “synthetic data set” consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role.
For the task of sanitizing with a synthetic dataset output, we map the boundary between computational feasibility and infeasibility with respect to a variety of utility measures. For the (potentially easier) task of sanitizing with unrestricted output format, we show a tight qualitative and quantitative connection between hardness of sanitizing and the existence of traitor tracing schemes.
The goal of a statistical database is to provide statistics about a population while simultaneously protecting the privacy of the individual records in the database. The tension between privacy and usability of statistical databases has attracted much attention in statistics, theoretical computer science, security, and database communities in recent years. A line of research initiated by Dinur and Nissim investigates for a particular type of queries, lower bounds on the distortion needed in order to prevent gross violations of privacy. The first result in the current paper simplifies and sharpens the Dinur and Nissim result.
The Dinur-Nissim style results are strong because they demonstrate insecurity of all low-distortion privacy mechanisms. The attacks have an all-or-nothing flavor: letting n denote the size of the database, Ω(n) queries are made before anything is learned, at which point Θ(n) secret bits are revealed. Restricting attention to a wide and realistic subset of possible low-distortion mechanisms, our second result is a more acute attack, requiring only a fixed number of queries for each bit revealed.
Consider a pollster who wishes to collect private, sensitive data from a number of distrustful individuals. How might the pollster convince the respondents that it is trustworthy? Alternately, what mechanism could the respondents insist upon to ensure that mismanagement of their data is detectable and publicly demonstrable?
We detail this problem, and provide simple data submission protocols with the properties that a) leakage of private data by the pollster results in evidence of the transgression and b) the evidence cannot be fabricated without breaking cryptographic assumptions. With such guarantees, a responsible pollster could post a “privacy-bond”, forfeited to anyone who can provide evidence of leakage. The respondents are assured that appropriate penalties are applied to a leaky pollster, while the protection from spurious indictment ensures that any honest pollster has no disincentive to participate in such a scheme.
In this work we provide efficient distributed protocols for generating shares of random noise, secure against malicious participants. The purpose of the noise generation is to create a distributed implementation of the privacy-preserving statistical databases described in recent papers [14,4,13]. In these databases, privacy is obtained by perturbing the true answer to a database query by the addition of a small amount of Gaussian or exponentially distributed random noise. The computational power of even a simple form of these databases, when the query is just of the form ∑i f(di), that is, the sum over all rows i in the database of a function f applied to the data in row i, has been demonstrated in [4]. A distributed implementation eliminates the need for a trusted database administrator.
The results for noise generation are of independent interest. The generation of Gaussian noise introduces a technique for distributing shares of many unbiased coins with fewer executions of verifiable secret sharing than would be needed using previous approaches (reduced by a factor of n). The generation of exponentially distributed noise uses two shallow circuits: one for generating many arbitrarily but identically biased coins at an amortized cost of two unbiased random bits apiece, independent of the bias, and the other to combine bits of appropriate biases to obtain an exponential distribution.
We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conﬂicting requirements: privacy for the respondents and utility of the sanitized data. Unlike in the study of secure function evaluation, in which privacy is preserved to the extent possible given a speciﬁc functionality goal, in the census problem privacy is paramount; intuitively, things that cannot be learned “safely” should not be learned at all.
An important contribution of this work is a deﬁnition of privacy (and privacy compromise) for statistical databases, together with a method for describing and comparing the privacy oﬀered by speciﬁc sanitization techniques. We obtain several privacy results using two diﬀerent sanitization techniques, and then show how to combine them via cross training. We also obtain two utility results involving clustering.
In a recent paper Dinur and Nissim considered a statistical database in which a trusted database administrator monitors queries and introduces noise to the responses with the goal of maintaining data privacy. Under a rigorous deﬁnition of breach of privacy, Dinur and Nissim proved that unless the total number of queries is sub-linear in the size of the database, a substantial amount of noise is required to avoid a breach, rendering the database almost useless.
As databases grow increasingly large, the possibility of being able to query only a sub-linear number of times becomes realistic. We further investigate this situation, generalizing the previous work in two important directions: multi-attribute databases (previous work dealt only with single-attribute databases) and vertically partitioned databases, in which diﬀerent subsets of attributes are stored in diﬀerent databases. In addition, we show how to use our techniques for datamining on published noisy statistics.