What is K-Anonymity?
K-anonymity is a data anonymization technique used to protect sensitive data in datasets. A dataset is said to possess k-anonymity if any given record is indistinguishable from at least k-1 other records concerning certain identifying attributes, known as quasi-identifiers. In simpler terms, if a dataset has k-anonymity, each individual's data cannot be differentiated from at least k-1 other individuals.
Why is K-Anonymity Important?
Consider a hospital's dataset containing patient records with attributes such as age, gender, ZIP code, and medical conditions. Even though names and Social Security numbers might be removed, combinations of quasi-identifiers (like age, gender, and ZIP code) can often be used to re-identify individuals. For example, knowing that a 28-year-old male lives in ZIP code 12345 could be enough to single out a specific person. K-anonymity ensures that such combinations are generalized or suppressed, making re-identification significantly harder.
How Does K-Anonymity Work?
Let's dive into a practical example. Suppose we have the following dataset:
Initial Dataset
Let's assume we have a dataset containing sensitive information about individuals, including their Age, Gender, ZIP code, and Disease:
ID | Age | Gender | ZIP code | Disease |
---|---|---|---|---|
1 | 28 | Male | 12345 | Flu |
2 | 34 | Female | 12346 | Diabetes |
3 | 45 | Female | 12347 | Hypertension |
4 | 36 | Male | 12348 | Flu |
5 | 50 | Female | 12349 | Arthritis |
6 | 28 | Female | 12350 | Flu |
7 | 33 | Male | 12345 | Asthma |
8 | 45 | Male | 12346 | Cancer |
9 | 50 | Male | 12347 | Hypertension |
10 | 36 | Female | 12348 | Arthritis |
Generalized Dataset for 3-Anonymity
To achieve 3-anonymity, we need to generalize the quasi-identifiers (Age, Gender, ZIP code) so that each combination of these identifiers appears in at least three records.
Age | Gender | ZIP code | Disease |
---|---|---|---|
20-30 | Male | 1234* | Flu |
30-50 | Female | 1234* | Diabetes |
40-50 | Female | 1234* | Hypertension |
30-50 | Male | 1234* | Flu |
40-50 | Female | 1234* | Arthritis |
20-30 | Female | 1235* | Flu |
30-50 | Male | 1234* | Asthma |
40-50 | Male | 1234* | Cancer |
40-50 | Male | 1234* | Hypertension |
30-50 | Female | 1234* | Arthritis |
In this generalized dataset:
- Ages are grouped into ranges.
- ZIP codes are partially masked with an asterisk to represent a broader area.
- This ensures that each combination of quasi-identifiers appears in at least three records, providing 3-anonymity.
Explanation:
- Age: Ages are grouped into broader categories (e.g., 20-30, 30-50, 40-50) to make it harder to pinpoint an individual's exact age.
- Gender: Gender remains unchanged as it's already a binary attribute and generalization would not add privacy.
- ZIP code: The last digit of the ZIP code is masked with an asterisk to generalize the location information.
Resulting Dataset for 3-Anonymity:
In this example, any record in the dataset cannot be uniquely identified because each combination of Age, Gender, and ZIP code appears in at least three records. This makes it difficult to re-identify individuals based on these attributes alone, thereby providing a level of privacy protection.