Data cleaning – broad concepts and principles

Today’s topic is about data manipulation. Before we jump into the R functions, let’s talk about what are data and data chain/wrokflow first.

Data are a set of values of qualitative or quantitative variables collected through observations.

Raw data have not been “cleaned” to remove outliers, instrument/observation errors, or data entry errors. Raw data can be relative: data may be raw to you, but they may have been pre-processed by someone prior to you receiving them.

Every one on the data chain should:

  1. Keep a copy of the raw data
  2. Record all operations used to generate the clean data
  3. Document the contents of the clean data (e.g., meaning of variable names, issues, etc.)

Spreadsheet is probably the most common way to enter and organize data for most cases (when data size is relatively small). If this applies to your own work, be sure to read this excellent data organization with spreadsheet slides.

Summary here:

  1. Be consistent
  2. Write dates as YYYY-MM-DD
  3. Choose good names for things
  4. No empty cells
  5. One thing per cell
  6. Make it a rectangle
  7. Make a data dictionary
  8. No calculations in the data file
  9. No color/formatting as data
  10. Make backups
  11. Use data validation
  12. Save as plain text

Ok, with the above information, let’s talk about data cleaning, which generally takes the most of project time (like 80%). This is probably because that data cleaning is hard to be generalized to every project. But there are some common principles. And below is an excellent summary of data cleaning principles provided by Dr. Karl Broman.