Data cleaning – broad concepts and principles

Today’s topic is about data manipulation. Before we jump into the R functions, let’s talk about what are data and data chain/wrokflow first.

Data are a set of values of qualitative or quantitative variables collected through observations.

Raw data have not been “cleaned” to remove outliers, instrument/observation errors, or data entry errors. Raw data can be relative: data may be raw to you, but they may have been pre-processed by someone prior to you receiving them.

Every one on the data chain should:

  1. Keep a copy of the raw data
  2. Record all operations used to generate the clean data
  3. Document the contents of the clean data (e.g., meaning of variable names, issues, etc.)

Spreadsheet is probably the most common way to enter and organize data for most cases (when data size is relatively small). If this applies to your own work, be sure to read this excellent data organization with spreadsheet slides.

Summary here:

  1. Be consistent
  2. Write dates as YYYY-MM-DD
  3. Choose good names for things
  4. No empty cells
  5. One thing per cell
  6. Make it a rectangle
  7. Make a data dictionary
  8. No calculations in the data file
  9. No color/formatting as data
  10. Make backups
  11. Use data validation
  12. Save as plain text

Ok, with the above information, let’s talk about data cleaning, which generally takes the most of project time (like 80%). This is probably because that data cleaning is hard to be generalized to every project. But there are some common principles. And below is an excellent summary of data cleaning principles provided by Dr. Karl Broman.

data clean principles
data clean principles

Please do make sure to read his notes along with the slides.

Data cleaning – useful functions that are regularly used

Today, we will mainly introduce some functions from the tidyverse that I regularly use for daily work. Again, I will just focus on a very small proportion of them. To learn more, read the corresponding R packages’ documentations!

First, let’s get some datasets.

install.packages('neonDivData', repos = 'https://daijiang.r-universe.dev')

This data package has organismal data collected by NEON. You can get all the raw data from its data portal and R package neonUtilities.

library(neonDivData)
data_bird = neonDivData::data_bird[, c("siteID", "plotID", "pointID", "observation_datetime",
              "taxon_id", "taxon_name", "taxon_rank", "value", "unit", "variable_name",
              "latitude", "longitude", "elevation")]
DT::datatable(head(data_bird, n = 100))

Wow, there are lots of variables! And believe it or not, this is actually already a “cleaned” version of the data. Normally, we’d check the data structure and look at what values does each variable contain. For example:

str(neonDivData::data_bird)
## tibble [226,826 × 35] (S3: tbl_df/tbl/data.frame)
##  $ location_id               : chr [1:226826] "BART_025.birdGrid.brd" "BART_025.birdGrid.brd" "BART_025.birdGrid.brd" "BART_025.birdGrid.brd" ...
##  $ siteID                    : chr [1:226826] "BART" "BART" "BART" "BART" ...
##  $ plotID                    : chr [1:226826] "BART_025" "BART_025" "BART_025" "BART_025" ...
##  $ pointID                   : chr [1:226826] "C1" "C1" "C1" "C1" ...
##  $ unique_sample_id          : chr [1:226826] "BART_025.C1.2015-06-14" "BART_025.C1.2015-06-14" "BART_025.C1.2015-06-14" "BART_025.C1.2015-06-14" ...
##  $ observation_datetime      : POSIXct[1:226826], format: "2015-06-14 09:23:00" "2015-06-14 09:23:00" ...
##  $ taxon_id                  : chr [1:226826] "BCCH" "REVI" "BAWW" "BTNW" ...
##  $ taxon_name                : chr [1:226826] "Poecile atricapillus" "Vireo olivaceus" "Mniotilta varia" "Setophaga virens" ...
##  $ taxon_rank                : chr [1:226826] "species" "species" "species" "species" ...
##  $ variable_name             : chr [1:226826] "cluster size" "cluster size" "cluster size" "cluster size" ...
##  $ value                     : num [1:226826] 1 1 1 1 1 1 1 1 1 1 ...
##  $ unit                      : chr [1:226826] "count of individuals" "count of individuals" "count of individuals" "count of individuals" ...
##  $ pointCountMinute          : chr [1:226826] "1" "1" "2" "2" ...
##  $ targetTaxaPresent         : chr [1:226826] "Y" "Y" "Y" "Y" ...
##  $ nativeStatusCode          : chr [1:226826] "N" "N" "N" "N" ...
##  $ observerDistance          : chr [1:226826] "42" "9" "17" "50" ...
##  $ detectionMethod           : chr [1:226826] "singing" "singing" "singing" "singing" ...
##  $ visualConfirmation        : chr [1:226826] "No" "No" "No" "No" ...
##  $ sexOrAge                  : chr [1:226826] "Male" "Male" "Male" "Male" ...
##  $ release                   : chr [1:226826] "RELEASE-2022" "RELEASE-2022" "RELEASE-2022" "RELEASE-2022" ...
##  $ startCloudCoverPercentage : chr [1:226826] "20" "20" "20" "20" ...
##  $ endCloudCoverPercentage   : chr [1:226826] "40" "40" "40" "40" ...
##  $ startRH                   : chr [1:226826] "72" "72" "72" "72" ...
##  $ endRH                     : chr [1:226826] "56" "56" "56" "56" ...
##  $ observedHabitat           : chr [1:226826] "evergreen forest" "evergreen forest" "evergreen forest" "evergreen forest" ...
##  $ observedAirTemp           : chr [1:226826] "18" "18" "18" "18" ...
##  $ kmPerHourObservedWindSpeed: chr [1:226826] "1" "1" "1" "1" ...
##  $ samplingProtocolVersion   : chr [1:226826] "NEON.DOC.014041vG" "NEON.DOC.014041vG" "NEON.DOC.014041vG" "NEON.DOC.014041vG" ...
##  $ remarks                   : chr [1:226826] NA NA NA NA ...
##  $ clusterCode               : chr [1:226826] NA NA NA NA ...
##  $ latitude                  : num [1:226826] 44.1 44.1 44.1 44.1 44.1 ...
##  $ longitude                 : num [1:226826] -71.3 -71.3 -71.3 -71.3 -71.3 ...
##  $ elevation                 : num [1:226826] 576 576 576 576 576 ...
##  $ nlcdClass                 : chr [1:226826] "evergreenForest" "evergreenForest" "evergreenForest" "evergreenForest" ...
##  $ plotType                  : chr [1:226826] "distributed" "distributed" "distributed" "distributed" ...
table(neonDivData::data_bird$taxon_rank)
## 
##        class       family        genus      species speciesGroup    subfamily 
##         1791         1835          261       222552          139          114 
##   subspecies 
##          134

We will skip these steps and leave it to the “Exploratory data analysis” lecture.

Instead, I will introduce some functions by tasks.

All verbs in tidyverse work similarly:

  1. The first argument is a data frame.
  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
  3. The result is a new data frame.

How to select some columns/variables? select()

This dataset is a data frame. So, we can always use the data frame subsetting codes we learned to select specific columns.

data_bird[, c("siteID", "plotID", "pointID", "observation_datetime",
              "taxon_id", "taxon_name", "taxon_rank", 
              "latitude", "longitude", "elevation")]
## # A tibble: 226,826 × 10
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 3 more variables: latitude <dbl>, longitude <dbl>, elevation <dbl>

More conveniently, we can use the select() functions from dplyr.

library(dplyr, warn.conflicts = FALSE)
select(data_bird, siteID, plotID, pointID, date_time = observation_datetime,
       starts_with("taxon"),
       latitude, longitude, elevation)
## # A tibble: 226,826 × 10
##    siteID plotID   pointID date_time           taxon_id taxon_name    taxon_rank
##    <chr>  <chr>    <chr>   <dttm>              <chr>    <chr>         <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00 BCCH     Poecile atri… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00 REVI     Vireo olivac… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00 BAWW     Mniotilta va… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00 BTNW     Setophaga vi… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00 BTNW     Setophaga vi… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00 WIWR     Troglodytes … species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00 BAWW     Mniotilta va… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00 WIWR     Troglodytes … species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00 BTNW     Setophaga vi… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00 REVI     Vireo olivac… species   
## # ℹ 226,816 more rows
## # ℹ 3 more variables: latitude <dbl>, longitude <dbl>, elevation <dbl>

select() can be combined with other functions such as ends_with(), everything(), matches(), etc.

select(data_bird, ends_with("name"))
## # A tibble: 226,826 × 2
##    taxon_name           variable_name
##    <chr>                <chr>        
##  1 Poecile atricapillus cluster size 
##  2 Vireo olivaceus      cluster size 
##  3 Mniotilta varia      cluster size 
##  4 Setophaga virens     cluster size 
##  5 Setophaga virens     cluster size 
##  6 Troglodytes hiemalis cluster size 
##  7 Mniotilta varia      cluster size 
##  8 Troglodytes hiemalis cluster size 
##  9 Setophaga virens     cluster size 
## 10 Vireo olivaceus      cluster size 
## # ℹ 226,816 more rows
select(data_bird, where(is.numeric))
## # A tibble: 226,826 × 4
##    value latitude longitude elevation
##    <dbl>    <dbl>     <dbl>     <dbl>
##  1     1     44.1     -71.3      576.
##  2     1     44.1     -71.3      576.
##  3     1     44.1     -71.3      576.
##  4     1     44.1     -71.3      576.
##  5     1     44.1     -71.3      576.
##  6     1     44.1     -71.3      576.
##  7     1     44.1     -71.3      576.
##  8     1     44.1     -71.3      576.
##  9     1     44.1     -71.3      576.
## 10     1     44.1     -71.3      576.
## # ℹ 226,816 more rows
select(data_bird, latitude, longitude, everything())
## # A tibble: 226,826 × 13
##    latitude longitude siteID plotID   pointID observation_datetime taxon_id
##       <dbl>     <dbl> <chr>  <chr>    <chr>   <dttm>               <chr>   
##  1     44.1     -71.3 BART   BART_025 C1      2015-06-14 09:23:00  BCCH    
##  2     44.1     -71.3 BART   BART_025 C1      2015-06-14 09:23:00  REVI    
##  3     44.1     -71.3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW    
##  4     44.1     -71.3 BART   BART_025 C1      2015-06-14 09:23:00  BTNW    
##  5     44.1     -71.3 BART   BART_025 C1      2015-06-14 09:23:00  BTNW    
##  6     44.1     -71.3 BART   BART_025 B1      2015-06-14 09:43:00  WIWR    
##  7     44.1     -71.3 BART   BART_025 B1      2015-06-14 09:43:00  BAWW    
##  8     44.1     -71.3 BART   BART_025 B1      2015-06-14 09:43:00  WIWR    
##  9     44.1     -71.3 BART   BART_025 B1      2015-06-14 09:43:00  BTNW    
## 10     44.1     -71.3 BART   BART_025 A1      2015-06-14 10:31:00  REVI    
## # ℹ 226,816 more rows
## # ℹ 6 more variables: taxon_name <chr>, taxon_rank <chr>, value <dbl>,
## #   unit <chr>, variable_name <chr>, elevation <dbl>
select(data_bird, c(1, 3, 5)) # by location
## # A tibble: 226,826 × 3
##    siteID pointID taxon_id
##    <chr>  <chr>   <chr>   
##  1 BART   C1      BCCH    
##  2 BART   C1      REVI    
##  3 BART   C1      BAWW    
##  4 BART   C1      BTNW    
##  5 BART   C1      BTNW    
##  6 BART   B1      WIWR    
##  7 BART   B1      BAWW    
##  8 BART   B1      WIWR    
##  9 BART   B1      BTNW    
## 10 BART   A1      REVI    
## # ℹ 226,816 more rows

Note that sometimes the select function will be covered by the {raster::select} function. If this is the case, try to be explicit and use dplyr::select().

How to keep/remove some specific rows? filter()

Again, base R functions to subset data frames work here. But we will focus on the dplyr::filter() function. Note that sometimes the filter function will be covered by the {stats::filter} function. If you are not sure, just call the function filter [without ()] in R and it will let you know which package it is from.

filter(data_bird, plotID == "BART_025")
## # A tibble: 371 × 13
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 361 more rows
## # ℹ 6 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>
filter(data_bird, plotID == "BART_025", pointID == "C1") # a & b by default
## # A tibble: 63 × 13
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 C1      2016-06-24 09:39:00  BTNW     Setophaga v… species   
##  7 BART   BART_025 C1      2016-06-24 09:39:00  RBNU     Sitta canad… species   
##  8 BART   BART_025 C1      2016-06-24 09:39:00  BHVI     Vireo solit… species   
##  9 BART   BART_025 C1      2016-06-24 09:39:00  BTBW     Setophaga c… species   
## 10 BART   BART_025 C1      2016-06-24 09:39:00  REVI     Vireo oliva… species   
## # ℹ 53 more rows
## # ℹ 6 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>
filter(data_bird, plotID == "BART_025" | pointID == "C1") # a or b 
## # A tibble: 21,163 × 13
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 21,153 more rows
## # ℹ 6 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>
filter(data_bird, latitude > 40)
## # A tibble: 86,241 × 13
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 86,231 more rows
## # ℹ 6 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>

How to add new columns or update existing columns? mutate()

With base R functions, we can do this with assignment.

data_bird$new_column = data_bird$value * 10

Here is the dplyr way:

# update existing column
mutate(data_bird, value = value + 1)
## # A tibble: 226,826 × 14
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
# create new column(s)
mutate(data_bird, new_column = value * 10)
## # A tibble: 226,826 × 14
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
# the created column can be used immediately
mutate(data_bird, 
       new_column = value * 10,
       newnew_col = new_column * 10
       )
## # A tibble: 226,826 × 15
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 8 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>,
## #   newnew_col <dbl>
mutate(data_bird, lat_band = ifelse(latitude > 45, "high", "low")) |> 
  pull(lat_band) |> 
  table()
## 
##   high    low 
##  46437 180389
# recode is very useful to correct values
mutate(data_bird, siteID = recode(siteID, "BART" = "BART2"))
## # A tibble: 226,826 × 14
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART2  BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART2  BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART2  BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART2  BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART2  BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART2  BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART2  BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART2  BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART2  BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART2  BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>

How to get summary values for each group? summarise()

Suppose we want to calculate the number of species observed (species richness) at each site, how should we get such values? We can combine summarise() with group_by().

data_bird |> 
  group_by(siteID)
## # A tibble: 226,826 × 14
## # Groups:   siteID [47]
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
sp_rich_by_site <- data_bird |> 
  group_by(siteID) |> 
  summarise(sp_richness = n_distinct(taxon_id))
# same as length(unique(taxon_id))
sp_rich_by_site
## # A tibble: 47 × 2
##    siteID sp_richness
##    <chr>        <int>
##  1 ABBY            81
##  2 BARR            43
##  3 BART            80
##  4 BLAN            82
##  5 BONA            47
##  6 CLBJ           126
##  7 CPER            72
##  8 DCFS           120
##  9 DEJU            52
## 10 DELA            76
## # ℹ 37 more rows
# data can be grouped by multiple variables
sp_rich_by_site_point <- data_bird |> 
  group_by(siteID, pointID) |> 
  summarise(sp_richness = n_distinct(taxon_id),
            .groups = "drop")
sp_rich_by_site_point
## # A tibble: 303 × 3
##    siteID pointID sp_richness
##    <chr>  <chr>         <int>
##  1 ABBY   21               81
##  2 BARR   A1               27
##  3 BARR   A2               25
##  4 BARR   A3               26
##  5 BARR   B1               25
##  6 BARR   B2               29
##  7 BARR   B3               23
##  8 BARR   C1               23
##  9 BARR   C2               26
## 10 BARR   C3               32
## # ℹ 293 more rows
# summarise() can have multiple calculations
data_bird |> 
  group_by(siteID, pointID) |> 
  summarise(sp_richness = n_distinct(taxon_id),
            abundance = sum(value),
            n_obs = n(),
            .groups = "drop")
## # A tibble: 303 × 5
##    siteID pointID sp_richness abundance n_obs
##    <chr>  <chr>         <int>     <dbl> <int>
##  1 ABBY   21               81      1915  1839
##  2 BARR   A1               27       409   246
##  3 BARR   A2               25       357   223
##  4 BARR   A3               26       394   278
##  5 BARR   B1               25       497   253
##  6 BARR   B2               29       444   277
##  7 BARR   B3               23       324   252
##  8 BARR   C1               23       329   258
##  9 BARR   C2               26       336   265
## 10 BARR   C3               32       448   276
## # ℹ 293 more rows

Exercise

Install the following data package.

if(!require(palmerpenguins)) install.packages("palmerpenguins")
## Loading required package: palmerpenguins
palmerpenguins::penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
  1. For each species and island combination, find the average body mass and bill length.
  2. For the same species, do male penguins have larger body mass than females?

How to order the rows? arrange()

For the sp_rich_by_site_point data frame, how do we order the rows by sp_richness (high to low)?

arrange(sp_rich_by_site_point, desc(sp_richness))
## # A tibble: 303 × 3
##    siteID pointID sp_richness
##    <chr>  <chr>         <int>
##  1 DCFS   21              120
##  2 NOGP   21              106
##  3 SERC   21              100
##  4 WOOD   C1               93
##  5 WOOD   B1               92
##  6 WOOD   A3               90
##  7 SCBI   21               89
##  8 WOOD   A1               88
##  9 WOOD   A2               87
## 10 KONA   21               86
## # ℹ 293 more rows

How to remove rows (observations) with NA of latitude? drop_na()

Observations without latitude probably won’t be that useful, how should we remove them? We can do so with Base R:

mean(is.na(data_bird$latitude))
## [1] 0
data_bird[!is.na(data_bird$latitude), ]
## # A tibble: 226,826 × 14
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>

Or with filter():

filter(data_bird, !is.na(latitude))
## # A tibble: 226,826 × 14
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>

Or with the drop_na() function from tidyr package.

tidyr::drop_na(data_bird, latitude)
## # A tibble: 226,826 × 14
##    siteID plotID   pointID observation_datetime taxon_id taxon_name   taxon_rank
##    <chr>  <chr>    <chr>   <dttm>               <chr>    <chr>        <chr>     
##  1 BART   BART_025 C1      2015-06-14 09:23:00  BCCH     Poecile atr… species   
##  2 BART   BART_025 C1      2015-06-14 09:23:00  REVI     Vireo oliva… species   
##  3 BART   BART_025 C1      2015-06-14 09:23:00  BAWW     Mniotilta v… species   
##  4 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  5 BART   BART_025 C1      2015-06-14 09:23:00  BTNW     Setophaga v… species   
##  6 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  7 BART   BART_025 B1      2015-06-14 09:43:00  BAWW     Mniotilta v… species   
##  8 BART   BART_025 B1      2015-06-14 09:43:00  WIWR     Troglodytes… species   
##  9 BART   BART_025 B1      2015-06-14 09:43:00  BTNW     Setophaga v… species   
## 10 BART   BART_025 A1      2015-06-14 10:31:00  REVI     Vireo oliva… species   
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## #   latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>

How to combine multiple operations? |>

So, each of the above verb can be useful. They do one thing and do that well. We can combine them with the pipe |> together to do complex things at once. This would remove the need to name intermediate results.

For example, if we want to 1) remove NA in latitude, then 2) select only key variables, and 3) focus on some specific sites, and 4) calculate species richness of each site.

Without pipe, we need to do this:

d = tidyr::drop_na(data_bird, latitude)
d1 = select(d, siteID, taxon_id, value)
d2 = filter(d1, siteID %in% c("BART", "HARV", "BLAN"))
d3 = group_by(d2, siteID)
summarise(d3, sp_richness = n_distinct(taxon_id))
## # A tibble: 3 × 2
##   siteID sp_richness
##   <chr>        <int>
## 1 BART            80
## 2 BLAN            82
## 3 HARV           108

With pipe, we don’t need to save intermediate variables:

data_bird |> 
  tidyr::drop_na(latitude) |> # d
  select(siteID, taxon_id, value) |> # d1
  filter(siteID %in% c("BART", "HARV", "BLAN")) |> # d2
  group_by(siteID) |> # d3
  summarise(sp_richness = n_distinct(taxon_id))
## # A tibble: 3 × 2
##   siteID sp_richness
##   <chr>        <int>
## 1 BART            80
## 2 BLAN            82
## 3 HARV           108

How to derive a site by species matrix (wide data frame) from a long data frame? pivot_wider()

The current data_bird data frame is a long data frame: each row is an observation, each column is a variable, each cell is a single value. To learn more about tidy data, see here.

Some times we need to reshape the long data frame into a wide format: a site by species matrix, with cells to be presence/absence or abundance values. How to do this?

library(tidyr)
# let's select relevant columns first to make it simpler
d = select(data_bird, siteID, plotID, taxon_id, value) |> 
  distinct() # remove duplicated rows
d
## # A tibble: 28,127 × 4
##    siteID plotID   taxon_id value
##    <chr>  <chr>    <chr>    <dbl>
##  1 BART   BART_025 BCCH         1
##  2 BART   BART_025 REVI         1
##  3 BART   BART_025 BAWW         1
##  4 BART   BART_025 BTNW         1
##  5 BART   BART_025 WIWR         1
##  6 BART   BART_025 AMRE         1
##  7 BART   BART_025 SWTH         1
##  8 BART   BART_025 WBNU         1
##  9 BART   BART_025 LEFL         1
## 10 BART   BART_025 BTBW         1
## # ℹ 28,117 more rows
# aggregate first
d_long = d |> 
  group_by(siteID, plotID, taxon_id) |> 
  summarise(abundance = sum(value, na.rm = TRUE), .groups = "drop")
d_long
## # A tibble: 23,188 × 4
##    siteID plotID   taxon_id abundance
##    <chr>  <chr>    <chr>        <dbl>
##  1 ABBY   ABBY_001 AMCR             2
##  2 ABBY   ABBY_001 AMRO             1
##  3 ABBY   ABBY_001 BHGR             1
##  4 ABBY   ABBY_001 BRCR             1
##  5 ABBY   ABBY_001 BTPI             1
##  6 ABBY   ABBY_001 BTYW             1
##  7 ABBY   ABBY_001 GCKI             1
##  8 ABBY   ABBY_001 HEWA             1
##  9 ABBY   ABBY_001 HUVI             1
## 10 ABBY   ABBY_001 NOFL             1
## # ℹ 23,178 more rows
d_wide = pivot_wider(d_long, 
                     id_cols = c(siteID, plotID),
                     names_from = taxon_id, 
                     values_from = abundance,
                     values_fill = 0)
d_wide
## # A tibble: 644 × 578
##    siteID plotID    AMCR  AMRO  BHGR  BRCR  BTPI  BTYW  GCKI  HEWA  HUVI  NOFL
##    <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ABBY   ABBY_001     2     1     1     1     1     1     1     1     1     1
##  2 ABBY   ABBY_002     0     0     1     0     3     0     0     1     0     1
##  3 ABBY   ABBY_003     0     1     1     0     1     1     0     1     0     1
##  4 ABBY   ABBY_004     0     1     1     0     1     0     0     1     0     1
##  5 ABBY   ABBY_005     1     1     1     0     1     0     0     0     0     1
##  6 ABBY   ABBY_006     0     0     1     1     0     0     0     1     0     0
##  7 ABBY   ABBY_007     0     1     1     1     0     0     0     0     0     1
##  8 ABBY   ABBY_008     0     1     1     1     1     1     3     1     0     1
##  9 ABBY   ABBY_009     0     1     1     0     1     0     0     1     0     0
## 10 ABBY   ABBY_010     0     1     1     1     0     0     0     1     0     0
## # ℹ 634 more rows
## # ℹ 566 more variables: NOPO <dbl>, PAWR <dbl>, PIWO <dbl>, PSFL <dbl>,
## #   RBNU <dbl>, SPTO <dbl>, STJA <dbl>, SWTH <dbl>, WAVI <dbl>, WETA <dbl>,
## #   WIWA <dbl>, AMGO <dbl>, BEWR <dbl>, CORA <dbl>, DEJU <dbl>, DUFL <dbl>,
## #   HAWO <dbl>, MGWA <dbl>, ORJU <dbl>, OSFL <dbl>, PISI <dbl>, VATH <dbl>,
## #   WCSP <dbl>, WIFL <dbl>, BCCH <dbl>, CAJA <dbl>, CBCH <dbl>, MODO <dbl>,
## #   OCWA <dbl>, RBSA <dbl>, SOSP <dbl>, UNBI <dbl>, AUWA <dbl>, EVGR <dbl>, …
# we can convert it back to long format
d_wide |> 
  pivot_longer(cols = -c(siteID, plotID),
               names_to = "taxon_id",
               values_to = "abundance") |> 
  filter(abundance > 0)
## # A tibble: 23,188 × 4
##    siteID plotID   taxon_id abundance
##    <chr>  <chr>    <chr>        <dbl>
##  1 ABBY   ABBY_001 AMCR             2
##  2 ABBY   ABBY_001 AMRO             1
##  3 ABBY   ABBY_001 BHGR             1
##  4 ABBY   ABBY_001 BRCR             1
##  5 ABBY   ABBY_001 BTPI             1
##  6 ABBY   ABBY_001 BTYW             1
##  7 ABBY   ABBY_001 GCKI             1
##  8 ABBY   ABBY_001 HEWA             1
##  9 ABBY   ABBY_001 HUVI             1
## 10 ABBY   ABBY_001 NOFL             1
## # ℹ 23,178 more rows

How to add information from another data frame to a data frame? left_join()

Here is another data frame that contains site information.

neonDivData::neon_sites
## # A tibble: 81 × 10
##    `Site Name`            siteID `Domain Name` domainID State Latitude Longitude
##    <chr>                  <chr>  <chr>         <chr>    <chr>    <dbl>     <dbl>
##  1 Little Rock Lake       LIRO   Great Lakes   D05      WI        46.0     -89.7
##  2 West St Louis Creek    WLOU   Southern Roc… D13      CO        39.9    -106. 
##  3 Pu'u Maka'ala Natural… PUUM   Pacific Trop… D20      HI        19.6    -155. 
##  4 Flint River            FLNT   Southeast     D03      GA        31.2     -84.4
##  5 McDiffett Creek        MCDI   Prairie Peni… D06      KS        38.9     -96.4
##  6 Lewis Run              LEWI   Mid-Atlantic  D02      VA        39.1     -78.0
##  7 Blue River             BLUE   Southern Pla… D11      OK        34.4     -96.6
##  8 Teakettle 2 Creek      TECR   Pacific Sout… D17      CA        37.0    -119. 
##  9 Lower Hop Brook        HOPB   Northeast     D01      MA        42.5     -72.3
## 10 Martha Creek           MART   Pacific Nort… D16      WA        45.8    -122. 
## # ℹ 71 more rows
## # ℹ 3 more variables: `Site Type` <chr>, `Site Subtype` <chr>,
## #   `Site Host` <chr>

How can we add the site information to d_wide?

unique(d_wide$siteID)
##  [1] "ABBY" "BARR" "BART" "BLAN" "BONA" "CLBJ" "CPER" "DCFS" "DEJU" "DELA"
## [11] "DSNY" "GRSM" "GUAN" "HARV" "HEAL" "JERC" "JORN" "KONA" "KONZ" "LAJA"
## [21] "LENO" "MLBS" "MOAB" "NIWO" "NOGP" "OAES" "ONAQ" "ORNL" "OSBS" "PUUM"
## [31] "RMNP" "SCBI" "SERC" "SJER" "SOAP" "SRER" "STEI" "STER" "TALL" "TEAK"
## [41] "TOOL" "TREE" "UKFS" "UNDE" "WOOD" "WREF" "YELL"
d_wide_site = left_join(d_wide[, 1:5], neonDivData::neon_sites, by = "siteID")
d_wide_site
## # A tibble: 644 × 14
##    siteID plotID    AMCR  AMRO  BHGR `Site Name` `Domain Name`    domainID State
##    <chr>  <chr>    <dbl> <dbl> <dbl> <chr>       <chr>            <chr>    <chr>
##  1 ABBY   ABBY_001     2     1     1 Abby Road   Pacific Northwe… D16      WA   
##  2 ABBY   ABBY_002     0     0     1 Abby Road   Pacific Northwe… D16      WA   
##  3 ABBY   ABBY_003     0     1     1 Abby Road   Pacific Northwe… D16      WA   
##  4 ABBY   ABBY_004     0     1     1 Abby Road   Pacific Northwe… D16      WA   
##  5 ABBY   ABBY_005     1     1     1 Abby Road   Pacific Northwe… D16      WA   
##  6 ABBY   ABBY_006     0     0     1 Abby Road   Pacific Northwe… D16      WA   
##  7 ABBY   ABBY_007     0     1     1 Abby Road   Pacific Northwe… D16      WA   
##  8 ABBY   ABBY_008     0     1     1 Abby Road   Pacific Northwe… D16      WA   
##  9 ABBY   ABBY_009     0     1     1 Abby Road   Pacific Northwe… D16      WA   
## 10 ABBY   ABBY_010     0     1     1 Abby Road   Pacific Northwe… D16      WA   
## # ℹ 634 more rows
## # ℹ 5 more variables: Latitude <dbl>, Longitude <dbl>, `Site Type` <chr>,
## #   `Site Subtype` <chr>, `Site Host` <chr>

To learn more, check out the RStudio cheatsheets