In 1895, Lord Kelvin, the renowned physicist, declared “Heavier than air flying machines are impossible” and dismissed those who were pursuing such research. Had the scientific community heeded his words and those of other skeptics, the advancements in aviation that define our modern world would have sadly been held back. Yet, unfortunately some in the scientific community have not learned the lesson that betting against human ingenuity is a fool’s game. The most recent example of this comes from Arvind Narayanan and Ed Felten who in a recent paper declared that de-identification has never and will never work. (Their paper was intended as a rebuttal to a piece written by Dr. Ann Cavoukian, the former Ontario Privacy Commissioner, and me, which demonstrated that the claims made in the popular press about academic research on re-identification methods often overstate the findings or omit important details.)
The authors are making an incredible claim. They are not saying that de-identification sometimes fails (which is painfully obvious to even the casual observer), but rather that there is no such thing as anonymous data. Narayanan and Felten write, “there is no evidence that de-identification works either in theory or in practice.” Such a claim is not only blatantly false, but it is also dangerously misleading because it suggests data custodians should never release de-identified data sets.
The claim that anonymous data does not exist is trivial to refute. A de-identified dataset can be as simple as a spreadsheet containing a few columns of information with no associated personal identifiers. For example, survey data is often de-identified before it is shared with other researchers, and there is no evidence that these datasets have been routinely re-identified. Researchers have developed various techniques to de-identify data, from the low-tech black marker on photos to advanced statistical methods for more complex datasets. Or to take a more tangible example, consider the secret ballot, which is a core part of democracy in the United States. Every year millions of Americans cast a ballot on Election Day. Election official collect the ballot, strip them of any personally identifiable information, and tally the votes. While most of these data sets about how citizens vote are not released publicly, they have been de-identified for internal use.
Narayanan and Felten double-down on their absurdity by stating that “most ‘anonymized’ datasets require no more skill than programming and basic statistics to de-identify.” There are at least three problems with their argument. First, there are tons of anonymized survey datasets regularly published (e.g. the Pew Research Internet Project publishes its survey data), yet there are no examples of high school or college students re-identifying these datasets (nor anyone else for that matter). Second, Narayanan and Felten make the mistake of claiming de-identification does not work by pointing to poorly de-identified datasets. The most recent example of this is when New York City released its taxi cab data without properly de-identifying the data. For example, instead of releasing the medallion numbers for the drivers, the city used a one-way function to convert the medallion into a different number which is not a valid way to anonymize the data. Yet Narayanan and Felten chalk this up as an example of the brilliance of re-identification techniques, when it is really an example of terrible de-identification techniques. By their logic, every high school kid in America is an expert lock pick, so long as you are talking about breaking into homes that have their doors already unlocked. Third, regardless of how many people have the skill set to re-identify some poorly de-identified data sets, the real question is how much risk does this pose in practice. After all, there are many other ways that sensitive information may be discovered about someone that do not involve a course in statistics, from loose lips to late-night dumpster diving. The actual risk depends on a variety of factors, such as the incentives and disincentives of those with the skills to de-identify datasets to actually do so. While certainly some people are drawn to bad behavior, I’m no more worried about the average computer scientist using his or her skills for evil, than I am concerned about the average chemist building a bomb.
If Narayanan and Felten truly believe de-identified datasets do not exist, they should start by convincing their fellow faculty. After all, their own university’s institutional review board exempts researchers who use de-identified data from meeting certain requirements for research on humans. It’s unfortunate that the authors make such obviously false claims because it tarnishes the credibility of some otherwise good points. For example, they note the challenges of de-identifying location data and other high-dimensional data. There are indeed many important areas where new types of data are being created, and there is a need for new de-identification methods to deal with these data sets. That is one reason why groups like ITIF have been strong advocates for additional federal funding for privacy R&D.
Ultimately, de-identification is one tool among many that will be used to improve privacy for millions of individuals around the world, but it is undeniably an important one. It would be more productive to focus our discussions on how to best de-identify data, such as by developing industry best practices, rather than simply declaring the system unfixable and walking away.
Photo credit: Flickr user chelscore