Data cleaning
Data files should be prepared in such a way that others can easily understand and conduct analyses of the data. All variables should be named and labeled in a clear and persistent manner, and unnecessary variables should be removed. The use of version control is recommended during the cleaning of data, to make sure that all changes are adequately documented and can be traced back to older versions of the data file if needed.
Checklist for cleaning of data files
|
|
Unnecessary variables Have all unnecessary or inappropriate variables been removed from the version of the data file that is to be published? |
|
Clear and descriptive variable names
|
|
Spelling and typing errors Are there any spelling or typing errors, for example in the variable labels or values? How about in free-text variables (string variables)? |
|
Missing values Have missing values been coded in an appropriate manner? |
|
Order of variables Are the variables in the data file presented in a logical order? Sometimes it may be useful to group variables together based on their content or focus, especially when a dataset contains many variables. |
|
Credibility
|
|
Weighting of data
|