Hi! I’m Anthony Tockar. I am a masters student at Northwestern University and have been working with the Science team for the summer. This is the first of two posts I will contribute on the topic of differential privacy.
Anyone who has ever googled themselves, updated their Facebook settings or even received a call from a telemarketer knows about the importance of privacy. In a world where the data being collected on us is growing exponentially, attacks on our privacy are becoming commonplace. A quick search will reveal all kinds of horror stories, such as the identification of individuals’ AOL search queries, people’s movie ratings on Netflix, and the medical records of a former Massachusetts governor. In my next post, I will demonstrate how easy it is to track individuals in New York using data from the 2013 NYC Taxi data release. Clearly, common practices need to change when it comes to protecting the individual. Put another way, is it fair to citizens and consumers to use their data without even being able to guarantee their privacy?
Data custodians are acutely aware of this issue. Here at Neustar, for example, our datasets span many domains, and can often contain sensitive information, such as PII or “harmless” DNS lookups. Our customers, of course, trust us to keep this information private. Therefore it is not surprising that we employ a series of privacy experts to maintain the company’s contracts and policies with respect to data. As you might expect, it is also of interest to us in the data science team to know what rigor has been applied to this problem.
So – how is data privatized? It turns out that there are various methods that are commonly found in practice, from simple fixes like removing columns containing PII, to more advanced treatments such as k-anonymity and l-diversity. ALL of these methods have been shown to be vulnerable to attack in one way or another. However, in the last few years a new method has emerged that has survived close scrutiny: differential privacy. Unlike other methods, differential privacy operates off a solid mathematical foundation, making it possible to provide strong theoretical guarantees on the privacy and utility of released data.
The story of differential privacy, however, is not all good news. Despite showing massive potential, it has been used sparingly in practice, and remains largely confined to theoretical settings. In this post I hope to introduce the topic to a wider audience, and have included some interactive visualizations to help readers build their intuition. I will also discuss the relevance of differential privacy in more detail, and in doing so explore whether it really is the solution to the privacy puzzle.
Definition of Differential Privacy
This simple example should help illuminate the concept:
Suppose you have access to a database that allows you to compute the total income of all residents in a certain area. If you knew that Mr. White was going to move to another area, simply querying this database before and after his move would allow you to deduce his income.
What could one do to stop this? Perhaps make sure that every query returns an approximation of the total income? This could accomplish the goal of allowing the dataset to yield accurate information while protecting Mr. White. Formally, differential privacy is defined as follows:
A randomized function K gives ε-differential privacy if for all data sets D and D′ differing on at most one row, and all S ⊆ Range(K),
This can be translated as meaning that the risk to one’s privacy should not substantially (as bounded by ε) increase as a result of participating in a statistical database. Thus an attacker should not be able to learn any information about any participant that they could not learn if the participant had opted out of the database. One could then state with some confidence that there is a low risk of any individual’s privacy being compromised as a result of their participation in the database.
Sounds great! So, how is it applied? Let’s go back to our example. If we can add some noise to the result of a query on our dataset, we should be able to ensure that the formula above holds. The function K() is our mechanism for adding this noise. Differential privacy-preserving mechanism design is a topic on its own. Here it suffices to say that there are different mechanisms available depending on the use case: for example, the Laplace mechanism, which I will introduce in the next paragraph, is unsuitable for categorical data – the exponential mechanism is more appropriate, though a strong case can be made for other methods depending on the outcome.
The Laplace mechanism involves adding random noise that conforms to the Laplace statistical distribution. That’s it! The obvious follow-up question is: well, how much noise should be added – i.e. how should we define our Laplace random variable? The 0-centered Laplace distribution has only one parameter (its scale), and this is directly proportional to its standard deviation, or noisiness. So how should we set the scale? Naturally it should have some dependence on the privacy parameter, ε. It should also depend on the nature of the query itself, and more specifically, the risk to the most different individual of having their private information teased out of the data. This can be defined mathematically, and is known as the sensitivity of the query:
Simply stated, it is the maximum difference in the values that the query f may take on a pair of databases that differ in only one row. There is a neat proof that shows that by adding a random Laplace() variable to a query, ε-differential privacy is guaranteed.
The more astute among us may be thinking: “Wait! If you just add a symmetrical, 0-centred distribution to the data, I can just run the query multiple times and take the average!” You’d be right if not for the property known as composition. Now while the definition differs depending on the technique used to query the data, in the simple, non-adaptive case, the composition property is as follows:
For a dataset queried q times, with each query having privacy parameter εi, the total privacy budget of the dataset is given by .
So when we think about ε, we should really be thinking about a privacy budget rather than purely the statistical upper bound of a query. Sure, each query yields , but it is the total ε we should be concerned about as it reflects the maximum privacy release allowable for the total query session. As each query answer leaks privacy, once the budget is exceeded the user will not be able to make any further queries. It is this feature that allows differential privacy to work in practice.
How might this look?
Since we now have a basic grasp of the concept, let’s take it for a spin! In our Mr. White example above, let’s assume the total income in his original neighborhood is $50 million. After he leaves, this figure drops to $49 million. Therefore, one can infer that his true income is $1 million. To keep his income private, we have to ensure the query response is noisy enough to ‘hide’ this information. In fact, to ensure we privatize income for all people in our dataset we need to make sure the richest person is protected as well. As it turns out, Mr. White was the richest person in his neighborhood, so the sensitivity is $1 million *. Play around with the tool below to see what this looks like in practice – I have left ε variable to emphasize its effect.
It is plain to see that Mr. White is protected – even at lenient levels of privacy, it is all but impossible to deduce his true income by running the query once. The histogram shows what would happen if we were given free rein to run multiple queries – eventually, we would reach a point where we could easily find his true income. However, we would be violating our requirement as mentioned above.
Now that we’ve seen what would happen with a query on a single value, let’s look at what happens when we privatize a whole distribution of values. The tool below provides the density plot for several different distributions, both real and randomly-generated, and again the user is free to experiment with different privacy parameters. Hopefully through interacting with this widget, the reader can develop a more intuitive understanding of the effects of differential privacy.
There is a key difference in how this query is privatized. As it is a density plot, we are only interested in how many observations occur at each value (similar to a histogram). Since this is a series of counts, our sensitivity is 1 (the maximum amount an individual can contribute). It follows that we can add Laplace random variables to the count at each value to guarantee privacy. Click on the graph for a more in-depth description.
* Note that this definition of sensitivity is dependent on our definition of the dataset – if we were considering every neighborhood in the US for example our sensitivity would be a lot higher – but we could adjust for this by setting a more relaxed privacy parameter given the difficulty of prizing an individual’s private data from a dataset of this breadth.
Relevance of Differential Privacy
Why should we care? At the very least differential privacy provides an interesting and rigorous framework around publishing data. So far, it is the only approach that has both theoretical bounds on the privacy of users in the dataset and still enables scientists to mine useful insights from it. Other solutions, such as anonymization (hashing), have time and again proven insufficient. Just look at the recent NYC taxi data that was de-anonymized within days of its release! Even with perfect anonymization, it has been posited that 63% of people in the US can be uniquely identified by just their birth date, zip code and gender.*
Unfortunately these attacks tend to stifle the release of data and information sharing. Differential privacy provides some hope. As we have learned, it is inherently flexible, which means it can easily be adapted to environments with differing privacy requirements. This flexibility does come at a cost: as we have seen, having very tiny privacy budgets (ε) can make some queries all but useless. However, as more people understand the concepts and more products get built on top of this paradigm we expect to see more sharing of data into the public domain without privacy concerns.
The more philosophical question is how private is private enough? Clearly, there is some tunability between how useful a differentially private query is and how ‘private’ it is. The aforementioned tradeoff between utility and privacy is unfortunately ‘left to the reader’. The literature does provide some rules of thumb for setting ε, with suggestions like 0.01, or ln2, etc. – however these have scant theoretical support. Perhaps most importantly, there are few, if any, precedents. At the end of the day it is the data curator’s job (or his lawyer) to decide on ‘private enough’. The lack of a clear framework to relate ε to privacy levels coupled with the difficulty of explaining it to the layperson has meant that differential privacy has largely remained confined to academia. However, as more people learn about it and more tools begin to emerge (PINQ, Airavat) this is starting to change. Clearly in this world of massive data sets and smart data scientists and hackers, data privacy needs to keep pace. We are very hopeful that these techniques are the next step.
* Aside: This is particularly interesting considering the birthday paradox, which states that in a room of 23 people, there is a >50% chance at least two of them share a birthday. However, by dissecting further by zip code and gender (and of course birth year), it appears that this no longer holds to the same degree.
Of most interest to us at Neustar is the practical implementation of differential privacy. For example, how can it be applied to the world of Internet advertising? What about sketching? Well, there has actually been some proprietary work in these areas. Our friend Muthu has used sketches to develop pan-private algorithms for dynamic data, which guarantee privacy and security for certain queries. In addition, this paper suggests differentially private measures for preventing attacks that utilize personalized advertising campaigns. Works such as these help pave the way for a new dawn in the way we think about data.
It is worth pointing out again that this post is not supposed to be a complete discussion of differential privacy and its nuances. Differential privacy comes in many different forms and iterations which have not been covered, and it does have several limitations. There have been reams written about it, and the interested reader is encouraged to seek out further information. The concept of differential privacy is still young, and there is still a lot to be fleshed out. What is clear is that it holds much potential.
In my next post, I will make it abundantly clear why we should be taking privacy seriously, and follow this up with a demonstration of how to privatize a real dataset in a way that protects its participants while retaining its ability to bestow useful insights.