select()
filter()
mutate()
summarise()
arrange()
NA
of latitude?
drop_na()
|>
pivot_wider()
left_join()
Today’s topic is about data manipulation. Before we jump into the R functions, let’s talk about what are data and data chain/wrokflow first.
Data are a set of values of qualitative or quantitative variables collected through observations.
Raw data have not been “cleaned” to remove outliers, instrument/observation errors, or data entry errors. Raw data can be relative: data may be raw to you, but they may have been pre-processed by someone prior to you receiving them.
Every one on the data chain should:
Spreadsheet is probably the most common way to enter and organize data for most cases (when data size is relatively small). If this applies to your own work, be sure to read this excellent data organization with spreadsheet slides.
Summary here:
Ok, with the above information, let’s talk about data cleaning, which generally takes the most of project time (like 80%). This is probably because that data cleaning is hard to be generalized to every project. But there are some common principles. And below is an excellent summary of data cleaning principles provided by Dr. Karl Broman.