What is K-Anonymity?

July 15, 2024Olga Druchek
None

K-anonymity is a data anonymization technique used to protect sensitive data in datasets. A dataset is said to possess k-anonymity if any given record is indistinguishable from at least k-1 other records concerning certain identifying attributes, known as quasi-identifiers. In simpler terms, if a dataset has k-anonymity, each individual's data cannot be differentiated from at least k-1 other individuals.

Why is K-Anonymity Important?

Consider a hospital's dataset containing patient records with attributes such as age, gender, ZIP code, and medical conditions. Even though names and Social Security numbers might be removed, combinations of quasi-identifiers (like age, gender, and ZIP code) can often be used to re-identify individuals. For example, knowing that a 28-year-old male lives in ZIP code 12345 could be enough to single out a specific person. K-anonymity ensures that such combinations are generalized or suppressed, making re-identification significantly harder.

How Does K-Anonymity Work?

Let's dive into a practical example. Suppose we have the following dataset:

Initial Dataset

Let's assume we have a dataset containing sensitive information about individuals, including their Age, Gender, ZIP code, and Disease:

IDAgeGenderZIP codeDisease
128Male12345Flu
234Female12346Diabetes
345Female12347Hypertension
436Male12348Flu
550Female12349Arthritis
628Female12350Flu
733Male12345Asthma
845Male12346Cancer
950Male12347Hypertension
1036Female12348Arthritis

Generalized Dataset for 3-Anonymity

To achieve 3-anonymity, we need to generalize the quasi-identifiers (Age, Gender, ZIP code) so that each combination of these identifiers appears in at least three records.

AgeGenderZIP codeDisease
20-30Male1234*Flu
30-50Female1234*Diabetes
40-50Female1234*Hypertension
30-50Male1234*Flu
40-50Female1234*Arthritis
20-30Female1235*Flu
30-50Male1234*Asthma
40-50Male1234*Cancer
40-50Male1234*Hypertension
30-50Female1234*Arthritis

In this generalized dataset:

  • Ages are grouped into ranges.
  • ZIP codes are partially masked with an asterisk to represent a broader area.
  • This ensures that each combination of quasi-identifiers appears in at least three records, providing 3-anonymity.

Explanation:

  • Age: Ages are grouped into broader categories (e.g., 20-30, 30-50, 40-50) to make it harder to pinpoint an individual's exact age.
  • Gender: Gender remains unchanged as it's already a binary attribute and generalization would not add privacy.
  • ZIP code: The last digit of the ZIP code is masked with an asterisk to generalize the location information.

Resulting Dataset for 3-Anonymity:

In this example, any record in the dataset cannot be uniquely identified because each combination of Age, Gender, and ZIP code appears in at least three records. This makes it difficult to re-identify individuals based on these attributes alone, thereby providing a level of privacy protection.