The output of a differentially private algorithm run on some data will permit the same infrences about any individual’s private information, regardless of whether that individual’s private information is included in the data.
The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy (wiki)
Note that differential privacy only protects specifically private information (i.e. it protects your column values, but not that you have a row in the first place)
ε-differential privacy is defined as:
for all .
- is a randomised algorithm
- are datasets that differ on a single element
- is a subset of the image of (image = set of values in codomain that can be produced)
- is some positive real number
The smaller the value of , the better.
The above formulation can be rewritten in the more clear form:
We can thus consider to be a bound on the absolute difference of the log probability of . The statement reflects whether the output of our algorithm on a dataset is within some arbitrary set of possible values.
"If there are independent mechanisms: whose privacy guarantees are differential privacy, respectively, then any function of them: is -differentially private."
If and differ by elements, we have -differential privacy. Or conversely, to attain -differential privacy for each individual item, we must set to .
"For any deterministic or randomized function defined over the image of the mechanism , if satisfies -differential privacy, so does "
Given some dataset operation , we can ensure -differential privacy by adding a particular quantity of laplace noise to the output:
we define the sensitivity of a function as follows: . Therefore we can say that to satisfy differential privacy, our -bound must be:
or conversely, to get a particular level of -differential privacy we must set .
Problem: Research has shown private data can be extracted from weights of ML models
- Split data into sensitive/private and public parts
- Split sensitive/private parts into disjoint subsets
- Train a different "teacher" model on each subset
- Aggregate and add sufficient laplace noise
- Integrate output into model trained on public data