In a recent Politico op-ed Rep. Jackie Speier (D-CA) wrote, “Online trackers claim the information they collect is anonymous and helps enhance the user’s experience. But that doesn’t tell the whole story. The truth is it only takes 33 ‘bits’ of information to uniquely identify someone.” While she was certainly not the first one to make this statement (indeed Arvind Narayanan at Stanford dedicates his blog “33 Bits of Entropy” to this topic), this assertion—that individuals can be identified with 33 bits of data—is quickly becoming one of the most misused and overused “facts” in the current data privacy debates to argue that virtually all information should be considered personally identifiable information (PII).
Consider the following statements:
- Rep. Speier, who recently introduced “Do Not Track” legislation, used the above statement to support the claim that “There is no longer any anonymity on the Web— unless we mandate it.”
- FTC Chairman Jon Leibowitz noted that “We used to have a distinction 10 years ago between personally identifiable information and non-PII. Now those distinctions have broken down.”
- Computer scientists Arvind Narayanan and VitalyShmatikov wrote that “Just as medieval alchemists were convinced a (mythical) philosopher’s stone can transmute lead into gold, today’s privacy practitioners believe that records containing sensitive individual data can be “de-identified” by removing or modifying PII.”
- EFF’s Senior Staff Technology Seth Schoen claimed that “Given the proper circumstances and insight, almost any kind of information might tend to identify an individual; information about people is more identifying than has been assumed, and in the long run the whole enterprise of classifying facts as ‘PII’ or ‘not PII’ is questionable.”
It’s time to clear up some fundamental misconceptions.
The basic idea here is simple (and not in dispute). A bit is a binary digit, a 0 or 1. With regards to information theory, entropy is defined as the amount of uncertainty of a random variable. Thus, for example, a coin toss has 1 bit of entropy (i.e. 2^1 or two possibilities).
This same notation can be used to describe anonymity. The global population of living humans is approximately 6.7 billion. Therefore, if an individual is chosen at random and no other information is known about this person, the probability of correctly identifying this person is 1 in 6.7 billion. This can also be expressed as 33 bits of entropy since 2^33 is about 8.6 billion (yes, this rounds up a bit).
Additional information about a person may reduce the size of the potential population, and thus reduces the relative anonymity of that person. For example, knowing a person’s gender reduces the amount of entropy by one bit (i.e. by half). Now the probability of correctly identifying this person drops to 1 in 3.35 billion (or 32 bits).
Other information may have a greater impact. For example, knowing the hometown of an individual may reduce the amount of entropy significantly. Of course, it depends on where this person lives. A person living in Paris, Texas—population 26,000—has much less anonymity than someone living in Paris, France—population 11.8 million.
The purpose of the scholarship in this sub-field of information theory is useful and should not be discounted. We have seen many important examples of poorly sanitized data sets, such as the AOL search logs, released publicly by both the government and the private sector that enabled enterprising statisticians to re-identify these records. Researchers like Latanya Sweeney have correctly warned data custodians to be more mindful of the datasets they released lest they be re-identified. These warning are necessary—data custodians should be more careful about how they disclose data and they need to develop a better appreciation of the potential impact of poorly de-identified data sets.
But these conclusions have been warped in the public discourse by privacy advocates who either do not understand or willingly ignore the details.
Anonymity in real life is much different than anonymity in the lab, and most people are content to be “one in a million” even if they cannot be “one in 6.7 billion.” In any data set, highly unique individuals (i.e. the outliers) may stand out, much like today’s celebrities do not enjoy the same level of anonymity as the average citizen. However, the fact that some individuals may be identified in a particular data set does not mean that any (or all) individuals may be identified in the data set.
More importantly, contrary to some claims, these findings do not support the conclusion that all information is personally identifiable information. Clearly not all “bits” are created equal. The ability to re-identify someone in a poorly anonymized data set still depends on the existence of another source of personally identifiable information to link to this information. This is the difference between uniquely identifying information (i.e. “Hey, I’ve seen you before!”) and information you can use to uniquely identify an individual (i.e. “Hey, that’s Bob!”). For example, simply knowing a person’s favorite color will not help you identify that person unless 1) you have another database that links favorite colors to individual identities and 2) nobody else likes the same color.
Why does this matter?
First, right now most definitions of personally identifiable information included in legislation are based on estimates about how easy it is to use this information to identify someone. However, these definitions are not always accurate. For example, an email address or credit card number is only personally identifiable information if it corresponds directly to one person (or a small group of people). But for people who use single-use credit card numbers or email addresses, this information may not be personally identifiable. Similarly a first initial and last name (e.g. “J. Smith”) may achieve a sufficient level of anonymity for someone, depending on the name and what other information is collected. Technical definitions like these should not be ossified in legislation because they can change over time.
Second, as I have argued before, the purpose of consumer privacy legislation should be to protect consumers from privacy harms, not to create prophylactic regulations on the use of data. For example, it is more effective to prohibit health insurers from discriminating against people with certain conditions, than to pass legislation that tries to control how information flows between entities. Businesses across virtually every sector benefit from collecting, sharing and using various types of information and everyone benefits when this can be done more securely. As I wrote a few years ago, a wide range of medical research requires access to data sets and we need policies in place to enable researchers to access this data. And as Jane Yakowitz correctly notes “proposals that inhibit the dissemination of research data dispose of an important public resource without reducing the privacy risks that actually put us in peril.”
It is both possible and desirable to create anonymized data sets. We do need to develop a better understanding of whether a data set has been properly sanitized and whether releasing the data poses a risk to individuals. Creating national standards for de-identification of data may help. This is one of the important tasks that a Data Policy Office in the Department of Commerce could take on. But as Congress considers legislation this session to regulate consumer privacy, we need to make sure they have their facts straight and their bits in order.