In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.
Contents
- Motivation
- Netflix Prize
- Massachusetts Group Insurance Commission GIC medical encounter database
- Metadata and Mobility databases
- Synopsis
- Principle and illustration
- Formal definition and example application
- Sensitivity
- Trade off between utility and privacy
- Other notions of differential privacy
- Differentially private mechanisms
- The Laplace mechanism
- Sequential composition
- Parallel composition
- Group privacy
- Stable transformations
- Adoption of differential privacy in real world applications
- References
Motivation
Consider a trusted party that holds a dataset of sensitive private information (for example, medical records, movie viewing, or email usage) that would like to provide global, statistical information about the data. Such a system is called a statistical database. However, providing aggregate statistical information about the data may reveal some information about the individuals. In fact, various ad-hoc approaches to anonymizing public records have failed when researchers managed to identify personal information by linking two or more separately innocuous databases. Differential privacy is a framework for formalizing privacy in statistical databases introduced in order to protect against these kinds of deanonymization techniques.
Netflix Prize
For example, in 2007, Netflix offered a $1 million prize for a 10% improvement in its recommendation system. Netflix also released a training dataset for the competing developers to train their systems. While releasing this dataset, they provided a disclaimer: To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids [sic] have been replaced by randomly assigned ids [sic].
Netflix is not the only movie-rating portal on the web; there are many others, including IMDb. On IMDb individuals can register and rate movies and they have the option of not keeping their details anonymous. Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, linked the Netflix anonymized training database with the IMDb database (using the date of rating by a user) to partially de-anonymize the Netflix training database, compromising the identity of some users.
Massachusetts Group Insurance Commission (GIC) medical encounter database
Latanya Sweeney from Carnegie Mellon University linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, and was able to identify the medical record of the governor of Massachusetts.
Metadata and Mobility databases
De Montjoye et al. from MIT introduced the notion of unicity (meaning uniqueness) and showed that 4 spatio-temporal points, approximate places and times, are enough to uniquely identify 95% of 1.5 million people in a mobility database. The study further shows that these constraints hold even when the resolution of the dataset is low, meaning that even coarse or blurred mobility datasets and metadata provide little anonymity.
Synopsis
In 2006, Cynthia Dwork defined the field of differential privacy, using work that started appearing in 2003. While showing that some semantic security goals, related to work from the 1970s of Tore Dalenius, were impossible, it identified new techniques for limiting the increased privacy risk resulting from inclusion of private data in a statistical database. This makes it possible in many cases to provide very accurate statistics from the database while still ensuring high levels of privacy.
Principle and illustration
Differential confidentiality is a process that introduces randomness into the data.
A simple example, especially developed in the social sciences, is to ask a person to answer the question "Do you own the attribute A?", according to the following procedure:
- Throw a coin.
- If head, then answer honestly.
- If tail, then throw the coin again and answer "Yes" if head, "No" if tail.
The confidentiality arises from the refutability of the individual responses.
But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. Thus, if p is the true proportion of people with A, then we expect to obtain (1/4)(1-p) + (3/4)p = (1/4) + p/2 positive responses. Hence is possible to estimate p.
PS: in particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be.
Formal definition and example application
Let
where the probability is taken over the randomness used by the algorithm.
According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.
For example, assume we have a database of medical records
Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query
Continuing this example, if we construct
Sensitivity
Let
where the maximum is over all pairs of datasets
In the example of the medical database above, if we consider
There are techniques (which are described below) using which we can create a differentially private algorithm for functions with low sensitivity.
Trade-off between utility and privacy
There is a trade-off between the accuracy of the statistics estimated in a privacy-preserving manner, and the privacy parameter ε.
Other notions of differential privacy
Since differential privacy is considered to be too strong for some applications, many weakened versions of privacy have been proposed. These include (ε, δ)-differential privacy, randomised differential privacy, and privacy under a metric.
Differentially private mechanisms
Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism and posterior sampling sample from a problem-dependent family of distributions instead.
The Laplace mechanism
Many differentially private methods add controlled noise to functions with low sensitivity. The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function
which is at most
Sequential composition
If we query an ε-differential privacy mechanism
Parallel composition
Furthermore, if the previous mechanisms are computed on disjoint subsets of the private database then the function
Group privacy
In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in
Thus setting ε instead to
Stable transformations
A transformation
This could be generalized to group privacy, as the group size could be thought of as the hamming distance
Adoption of differential privacy in real-world applications
Several uses of differential privacy in practice are known to date: