Data cleaning

Data files should be prepared in such a way that others can easily understand and conduct analyses of the data. All variables should be named and labeled in a clear and persistent manner, and unnecessary variables should be removed. The use of version control is recommended during the cleaning of data, to make sure that all changes are adequately documented and can be traced back to older versions of the data file if needed.

Checklist for cleaning of data files

  • Have all direct personal identifiers been removed from the data file (names of individuals, social security numbers, e-mail addresses, etc.)?
  • Does the data include any indirect identifiers, and if so, how much personal disclosure risk is there due to such identifiers?
 

Unnecessary variables

Have all unnecessary or inappropriate variables been removed from the version of the data file that is to be published?

 

Clear and descriptive variable names

  • Have all variables been given clear and descriptive names (e.g., Q1, Q2, etc.)?
  • Do all variables have clear and descriptive labels (e.g., the question asked, or a short description of its content)?
  • Have appropriate values been specified for each variable (e.g., 1 = Never, 2 = Sometimes, 3 = Often, 4 = Always, 99 = Missing)?
 

Spelling and typing errors

Are there any spelling or typing errors, for example in the variable labels or values? How about in free-text variables (string variables)?

 

Missing values 

Have missing values been coded in an appropriate manner? 

 

Order of variables

Are the variables in the data file presented in a logical order?

Sometimes it may be useful to group variables together based on their content or focus, especially when a dataset contains many variables.  

 

Credibility 

  • Are any unusually high or low values present in the data which seem unlikely (e.g., a salary figure of one individual has one too many zeros)?
  • Is there any repetition in the data that doesn't make sense (e.g., double entry of some participants)?
 

Weighting of data

  • Is the data weighted (is there a "weight variable" in the dataset)?
  • Does the weighting variable contain a descriptive label (e.g., on which grounds the data was weighted)?