What is K-Anonymity?

July 15, 2024 – Mykola Melnyk

K-anonymity is a data anonymization technique used to protect sensitive data in datasets. A dataset is said to possess k-anonymity if any given record is indistinguishable from at least k-1 other records concerning certain identifying attributes, known as quasi-identifiers. In simpler terms, if a dataset has k-anonymity, each individual's data cannot be differentiated from at least k-1 other individuals.

Why is K-Anonymity Important?

Consider a hospital's dataset containing patient records with attributes such as age, gender, ZIP code, and medical conditions. Even though names and Social Security numbers might be removed, combinations of quasi-identifiers (like age, gender, and ZIP code) can often be used to re-identify individuals. For example, knowing that a 28-year-old male lives in ZIP code 12345 could be enough to single out a specific person. K-anonymity ensures that such combinations are generalized or suppressed, making re-identification significantly harder.

How Does K-Anonymity Work?

Let's dive into a practical example. Suppose we have the following dataset:

Initial Dataset

Let's assume we have a dataset containing sensitive information about individuals, including their Age, Gender, ZIP code, and Disease:

ID	Age	Gender	ZIP code	Disease
1	28	Male	12345	Flu
2	34	Female	12346	Diabetes
3	45	Female	12347	Hypertension
4	36	Male	12348	Flu
5	50	Female	12349	Arthritis
6	28	Female	12350	Flu
7	33	Male	12345	Asthma
8	45	Male	12346	Cancer
9	50	Male	12347	Hypertension
10	36	Female	12348	Arthritis

Generalized Dataset for 3-Anonymity

To achieve 3-anonymity, we need to generalize the quasi-identifiers (Age, Gender, ZIP code) so that each combination of these identifiers appears in at least three records.

Age	Gender	ZIP code	Disease
20-30	Male	1234*	Flu
30-50	Female	1234*	Diabetes
40-50	Female	1234*	Hypertension
30-50	Male	1234*	Flu
40-50	Female	1234*	Arthritis
20-30	Female	1235*	Flu
30-50	Male	1234*	Asthma
40-50	Male	1234*	Cancer
40-50	Male	1234*	Hypertension
30-50	Female	1234*	Arthritis

In this generalized dataset:

Ages are grouped into ranges.
ZIP codes are partially masked with an asterisk to represent a broader area.
This ensures that each combination of quasi-identifiers appears in at least three records, providing 3-anonymity.

Explanation:

Age: Ages are grouped into broader categories (e.g., 20-30, 30-50, 40-50) to make it harder to pinpoint an individual's exact age.
Gender: Gender remains unchanged as it's already a binary attribute and generalization would not add privacy.
ZIP code: The last digit of the ZIP code is masked with an asterisk to generalize the location information.

Resulting Dataset for 3-Anonymity:

In this example, any record in the dataset cannot be uniquely identified because each combination of Age, Gender, and ZIP code appears in at least three records. This makes it difficult to re-identify individuals based on these attributes alone, thereby providing a level of privacy protection.