Puneet Varma (Editor)

De anonymization

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

De-anonymization (also spelt as deanonymization) is a strategy in data mining in which anonymous data is cross-referenced with other sources of data to re-identify the anonymous data source. The term became popular in 2006 when Arvind Narayanan and Vitaly Shmatikov entered a contest hosted by Netflix, and applied their de-anonymization techniques to successfully identify Netflix data for a number of specific members.

More and more data are becoming publicly available over the Internet. These data are released after applying some anonymization techniques like removing personally identifiable information (PII) such as names, addresses and social security numbers to ensure the sources' privacy. This assurance of privacy allows the government to legally share limited data sets with third parties without requiring written permission. Such data has proved to be very valuable for researchers, particularly in health care. However, as the Netflix contest dramatically revealed so much of data is available, even after anonymization, that a specific individual’s identity could be re-discovered.

Examples of de-anonymization

  • "Researchers at MIT and the Université Catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them. In other words, to extract the complete location information for a single person from an “anonymized” data set of more than a million people, all you would need to do is place him or her within a couple of hundred yards of a cellphone transmitter, sometime over the course of an hour, four times in one year. A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person’s whereabouts."
  • "Sharing sequencing data sets without identifiers has become a common practice in genomics. Surnames can be recovered from personal genomes by profiling short tandem repeats on the Y chromosome and querying recreational genetic genealogy databases. It is shown that a combination of a surname with other types of metadata, such as age and state, can be used to identity of the person..."
  • References

    De-anonymization Wikipedia