Data anonymization

Before research data can be published in open access, all direct personal identifiers must be removed from the data, such as names of individuals, addresses, telephone numbers and ID numbers. The data also needs to be assessed with regard to the presence of indirect personal identifiers, as such information, either alone or in conjunction with other variables, convey risk for exposing the identities of individuals. Examples of indirect identifiers are unusual or specialized job titles, rare religious beliefs and membership in small groups or associations. The greater number of indirect identifiers in the data, the greater the risk for disclosure, and the need for taking action (e.g., deleting certain variables, erasing or encrypting information).

Please keep in mind that if many variables or important information are removed from a dataset, this may significantly reduce the usefulness of the data. Perhaps, in some cases, a better option would be to publish the data in restricted access instead of open access, where stricter user requirements apply.

 

Assessment of disclosure risk

Typically assessment of risk for personal disclosure involves a systematic examination of all variables. Needless to say, this is especially important when data covers sensitive topics, such as health status, illegal activity, or other sensitive information. One common criteria is that if a response category is endorsed by less than 20 individuals, the variable should be examined in more detail. Disclosure risk analysis can involve conducting frequency analyses of variables to determine low-frequency responses and extreme outliers. This should always be complemented with qualitative analysis of risk characteristics based on local knowledge of field work, its design and the population and individuals studied.

 

Example: A study of workers' attitudes and only one woman is employed at the workplace. 

Example: Age and marital status are not obvious personal identities, but what if one participant is 18 years of age and divorced?

 

A number of software and scripts are available that can be used to analyze risk for personal exposure in scientific data. However, it is important to note that one should never rely solemnly on such programs, and expertise in the field of study is always needed. 

 

Free-text information (string variables)

You also need to check carefully the free-text answers of participants, for example in string variables. These may include information that, at first glance, may not seem harmful, but can be used to trace answers back to individuals, in a direct or indirect way.

Example: An individual says he has been on the town council for the last few years.

Example: An unusually high number of square meters of a private house in a sparsely populated town.

 

Measures to reduce disclosure risk in scientific data

As mentioned earlier, direct personal identifiers must always be removed before data can be published, whether the data will be made available using an open or restricted access protocol. In addition, the data must be analyzed thoroughly with regard to indirect identifiers, which can also lead to the exposure of individuals. The table below provides an overview of methods that can be used to reduce the risk of disclosure due to direct and indirect personal identifiers in research data.

Measures to reduce disclosure risk in scientific data
Direct personal identifers Example
Remove variables that contain direct identifiers. Variables that contain ID numbers, names, e-mail addresses, telephone numbers, third-party IDs, information about workplace/job title, vehicle registration, IP addresses, student IDs, etc.
Free-text answers Example
Conduct a thorough check of all answers to open-ended questions (e.g., text in string variables) -> change (encrypt) or delete answers that contain traceable information -> If many answers are of a similar nature, a categorical (nominal) variable may be created to represent similar answers in a less detailed way.

Participant answer: "My mother is from Libya but moved to Iceland six years ago" -> Remove or encrypt the answer.

If many people answer in a similar way, the free-text variable can be replaced with a nominal variable that covers all answers in a broader way (e.g., "A close relative moved to Iceland a few years ago"). 

Demographic variables Example
Age -> should always be categorized into age groups. If an age group contains few individuals (e.g., less than 20 persons), a wider age range should be calculated. For example, place all individuals > 75 years in the same category (i.e., "75 years and older"). 

15-19 years
20-15 years
26-30 years
etc.

Labor market status -> should be categorized so that there are at least 20 individuals in each group.

Full time
Part-time
Retired
Other (e.g., unemployed, volunteer, disability)

Field of education -> categorize so that there are at least 20 individuals in each group (e.g., ISCED-F categorization).

Engineering
Food industry
Health care

Level of education (ISCED category) -> use only broad categories (maximum two digits), but not detailed subgroup categories.

Primary school exam
Student exam
Technical or vocational studies
etc.

Numbers of years of study -> categorize so that there are at least 20 individuals in each group.

0-4 years 
5-8 years
etc.

Income -> categorize into broad categories.  Below ISK 400.000 a month
Between ISK 400.000-600.000 a month
etc.
Numbers of persons in a household -> categorize so that there are at least 20 persons in each group. 1 person
2 persons 
3 persons
4 persons 
> 5 persons
Native language -> categorize so that there are at least 20 individuals in each group. 

Icelandic
Polish
Vietnamese
etc.

Health information -> categorize so that there are at least 20 individuals in each group
 

Depression:
Yes
No

Country of birth -> categorize according to the United Nations' Standard country or area codes for statistical use (UN M49); use broader categories if less than 20 individuals are from a particular country/region.

East Africa
Central Africa
South Africa
etc.

Job -> categorize according to the International standard classification of occupations (ISCO); use broader categories if less than 20 individuals and/or if a highly specialized workplace (see also ÍSTARF21, Statistics Iceland). Sales jobs at checkouts in stores and supermarkets,
Teaching at the upper secondary level,
Specialized jobs in the fishing industry,
etc.