select()
filter()
mutate()
summarise()
arrange()
NA
of latitude?
drop_na()
|>
pivot_wider()
left_join()
Today’s topic is about data manipulation. Before we jump into the R functions, let’s talk about what are data and data chain/wrokflow first.
Data are a set of values of qualitative or quantitative variables collected through observations.
Raw data have not been “cleaned” to remove outliers, instrument/observation errors, or data entry errors. Raw data can be relative: data may be raw to you, but they may have been pre-processed by someone prior to you receiving them.
Every one on the data chain should:
Spreadsheet is probably the most common way to enter and organize data for most cases (when data size is relatively small). If this applies to your own work, be sure to read this excellent data organization with spreadsheet slides.
Summary here:
Ok, with the above information, let’s talk about data cleaning, which generally takes the most of project time (like 80%). This is probably because that data cleaning is hard to be generalized to every project. But there are some common principles. And below is an excellent summary of data cleaning principles provided by Dr. Karl Broman.
Please do make sure to read his notes along with the slides.
Today, we will mainly introduce some functions from the
tidyverse
that I regularly use for daily work. Again, I
will just focus on a very small proportion of them. To learn more, read
the corresponding R packages’ documentations!
First, let’s get some datasets.
This data package has organismal data collected by NEON. You can get all the raw
data from its data
portal and R package neonUtilities
.
library(neonDivData)
data_bird = neonDivData::data_bird[, c("siteID", "plotID", "pointID", "observation_datetime",
"taxon_id", "taxon_name", "taxon_rank", "value", "unit", "variable_name",
"latitude", "longitude", "elevation")]
DT::datatable(head(data_bird, n = 100))
Wow, there are lots of variables! And believe it or not, this is actually already a “cleaned” version of the data. Normally, we’d check the data structure and look at what values does each variable contain. For example:
## tibble [226,826 × 35] (S3: tbl_df/tbl/data.frame)
## $ location_id : chr [1:226826] "BART_025.birdGrid.brd" "BART_025.birdGrid.brd" "BART_025.birdGrid.brd" "BART_025.birdGrid.brd" ...
## $ siteID : chr [1:226826] "BART" "BART" "BART" "BART" ...
## $ plotID : chr [1:226826] "BART_025" "BART_025" "BART_025" "BART_025" ...
## $ pointID : chr [1:226826] "C1" "C1" "C1" "C1" ...
## $ unique_sample_id : chr [1:226826] "BART_025.C1.2015-06-14" "BART_025.C1.2015-06-14" "BART_025.C1.2015-06-14" "BART_025.C1.2015-06-14" ...
## $ observation_datetime : POSIXct[1:226826], format: "2015-06-14 09:23:00" "2015-06-14 09:23:00" ...
## $ taxon_id : chr [1:226826] "BCCH" "REVI" "BAWW" "BTNW" ...
## $ taxon_name : chr [1:226826] "Poecile atricapillus" "Vireo olivaceus" "Mniotilta varia" "Setophaga virens" ...
## $ taxon_rank : chr [1:226826] "species" "species" "species" "species" ...
## $ variable_name : chr [1:226826] "cluster size" "cluster size" "cluster size" "cluster size" ...
## $ value : num [1:226826] 1 1 1 1 1 1 1 1 1 1 ...
## $ unit : chr [1:226826] "count of individuals" "count of individuals" "count of individuals" "count of individuals" ...
## $ pointCountMinute : chr [1:226826] "1" "1" "2" "2" ...
## $ targetTaxaPresent : chr [1:226826] "Y" "Y" "Y" "Y" ...
## $ nativeStatusCode : chr [1:226826] "N" "N" "N" "N" ...
## $ observerDistance : chr [1:226826] "42" "9" "17" "50" ...
## $ detectionMethod : chr [1:226826] "singing" "singing" "singing" "singing" ...
## $ visualConfirmation : chr [1:226826] "No" "No" "No" "No" ...
## $ sexOrAge : chr [1:226826] "Male" "Male" "Male" "Male" ...
## $ release : chr [1:226826] "RELEASE-2022" "RELEASE-2022" "RELEASE-2022" "RELEASE-2022" ...
## $ startCloudCoverPercentage : chr [1:226826] "20" "20" "20" "20" ...
## $ endCloudCoverPercentage : chr [1:226826] "40" "40" "40" "40" ...
## $ startRH : chr [1:226826] "72" "72" "72" "72" ...
## $ endRH : chr [1:226826] "56" "56" "56" "56" ...
## $ observedHabitat : chr [1:226826] "evergreen forest" "evergreen forest" "evergreen forest" "evergreen forest" ...
## $ observedAirTemp : chr [1:226826] "18" "18" "18" "18" ...
## $ kmPerHourObservedWindSpeed: chr [1:226826] "1" "1" "1" "1" ...
## $ samplingProtocolVersion : chr [1:226826] "NEON.DOC.014041vG" "NEON.DOC.014041vG" "NEON.DOC.014041vG" "NEON.DOC.014041vG" ...
## $ remarks : chr [1:226826] NA NA NA NA ...
## $ clusterCode : chr [1:226826] NA NA NA NA ...
## $ latitude : num [1:226826] 44.1 44.1 44.1 44.1 44.1 ...
## $ longitude : num [1:226826] -71.3 -71.3 -71.3 -71.3 -71.3 ...
## $ elevation : num [1:226826] 576 576 576 576 576 ...
## $ nlcdClass : chr [1:226826] "evergreenForest" "evergreenForest" "evergreenForest" "evergreenForest" ...
## $ plotType : chr [1:226826] "distributed" "distributed" "distributed" "distributed" ...
##
## class family genus species speciesGroup subfamily
## 1791 1835 261 222552 139 114
## subspecies
## 134
We will skip these steps and leave it to the “Exploratory data analysis” lecture.
Instead, I will introduce some functions by tasks.
All verbs in tidyverse
work
similarly:
select()
This dataset is a data frame. So, we can always use the data frame subsetting codes we learned to select specific columns.
data_bird[, c("siteID", "plotID", "pointID", "observation_datetime",
"taxon_id", "taxon_name", "taxon_rank",
"latitude", "longitude", "elevation")]
## # A tibble: 226,826 × 10
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 3 more variables: latitude <dbl>, longitude <dbl>, elevation <dbl>
More conveniently, we can use the select()
functions
from dplyr
.
library(dplyr, warn.conflicts = FALSE)
select(data_bird, siteID, plotID, pointID, date_time = observation_datetime,
starts_with("taxon"),
latitude, longitude, elevation)
## # A tibble: 226,826 × 10
## siteID plotID pointID date_time taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atri… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo olivac… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta va… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga vi… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga vi… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes … species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta va… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes … species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga vi… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo olivac… species
## # ℹ 226,816 more rows
## # ℹ 3 more variables: latitude <dbl>, longitude <dbl>, elevation <dbl>
select()
can be combined with other functions such as
ends_with()
, everything()
,
matches()
, etc.
## # A tibble: 226,826 × 2
## taxon_name variable_name
## <chr> <chr>
## 1 Poecile atricapillus cluster size
## 2 Vireo olivaceus cluster size
## 3 Mniotilta varia cluster size
## 4 Setophaga virens cluster size
## 5 Setophaga virens cluster size
## 6 Troglodytes hiemalis cluster size
## 7 Mniotilta varia cluster size
## 8 Troglodytes hiemalis cluster size
## 9 Setophaga virens cluster size
## 10 Vireo olivaceus cluster size
## # ℹ 226,816 more rows
## # A tibble: 226,826 × 4
## value latitude longitude elevation
## <dbl> <dbl> <dbl> <dbl>
## 1 1 44.1 -71.3 576.
## 2 1 44.1 -71.3 576.
## 3 1 44.1 -71.3 576.
## 4 1 44.1 -71.3 576.
## 5 1 44.1 -71.3 576.
## 6 1 44.1 -71.3 576.
## 7 1 44.1 -71.3 576.
## 8 1 44.1 -71.3 576.
## 9 1 44.1 -71.3 576.
## 10 1 44.1 -71.3 576.
## # ℹ 226,816 more rows
## # A tibble: 226,826 × 13
## latitude longitude siteID plotID pointID observation_datetime taxon_id
## <dbl> <dbl> <chr> <chr> <chr> <dttm> <chr>
## 1 44.1 -71.3 BART BART_025 C1 2015-06-14 09:23:00 BCCH
## 2 44.1 -71.3 BART BART_025 C1 2015-06-14 09:23:00 REVI
## 3 44.1 -71.3 BART BART_025 C1 2015-06-14 09:23:00 BAWW
## 4 44.1 -71.3 BART BART_025 C1 2015-06-14 09:23:00 BTNW
## 5 44.1 -71.3 BART BART_025 C1 2015-06-14 09:23:00 BTNW
## 6 44.1 -71.3 BART BART_025 B1 2015-06-14 09:43:00 WIWR
## 7 44.1 -71.3 BART BART_025 B1 2015-06-14 09:43:00 BAWW
## 8 44.1 -71.3 BART BART_025 B1 2015-06-14 09:43:00 WIWR
## 9 44.1 -71.3 BART BART_025 B1 2015-06-14 09:43:00 BTNW
## 10 44.1 -71.3 BART BART_025 A1 2015-06-14 10:31:00 REVI
## # ℹ 226,816 more rows
## # ℹ 6 more variables: taxon_name <chr>, taxon_rank <chr>, value <dbl>,
## # unit <chr>, variable_name <chr>, elevation <dbl>
## # A tibble: 226,826 × 3
## siteID pointID taxon_id
## <chr> <chr> <chr>
## 1 BART C1 BCCH
## 2 BART C1 REVI
## 3 BART C1 BAWW
## 4 BART C1 BTNW
## 5 BART C1 BTNW
## 6 BART B1 WIWR
## 7 BART B1 BAWW
## 8 BART B1 WIWR
## 9 BART B1 BTNW
## 10 BART A1 REVI
## # ℹ 226,816 more rows
Note that sometimes the select function will be
covered by the {raster::select}
function. If this is the
case, try to be explicit and use dplyr::select()
.
filter()
Again, base R functions to subset data frames work here. But we will
focus on the dplyr::filter()
function.
Note that sometimes the filter function will be covered
by the {stats::filter}
function. If you are not sure, just
call the function filter
[without ()
] in R and
it will let you know which package it is from.
## # A tibble: 371 × 13
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 361 more rows
## # ℹ 6 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>
## # A tibble: 63 × 13
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 C1 2016-06-24 09:39:00 BTNW Setophaga v… species
## 7 BART BART_025 C1 2016-06-24 09:39:00 RBNU Sitta canad… species
## 8 BART BART_025 C1 2016-06-24 09:39:00 BHVI Vireo solit… species
## 9 BART BART_025 C1 2016-06-24 09:39:00 BTBW Setophaga c… species
## 10 BART BART_025 C1 2016-06-24 09:39:00 REVI Vireo oliva… species
## # ℹ 53 more rows
## # ℹ 6 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>
## # A tibble: 21,163 × 13
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 21,153 more rows
## # ℹ 6 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>
## # A tibble: 86,241 × 13
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 86,231 more rows
## # ℹ 6 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>
mutate()
With base R functions, we can do this with assignment.
Here is the dplyr
way:
## # A tibble: 226,826 × 14
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
## # A tibble: 226,826 × 14
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
# the created column can be used immediately
mutate(data_bird,
new_column = value * 10,
newnew_col = new_column * 10
)
## # A tibble: 226,826 × 15
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 8 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>,
## # newnew_col <dbl>
##
## high low
## 46437 180389
# recode is very useful to correct values
mutate(data_bird, siteID = recode(siteID, "BART" = "BART2"))
## # A tibble: 226,826 × 14
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART2 BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART2 BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART2 BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART2 BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART2 BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART2 BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART2 BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART2 BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART2 BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART2 BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
summarise()
Suppose we want to calculate the number of species observed (species
richness) at each site, how should we get such values? We can combine
summarise()
with group_by()
.
## # A tibble: 226,826 × 14
## # Groups: siteID [47]
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
sp_rich_by_site <- data_bird |>
group_by(siteID) |>
summarise(sp_richness = n_distinct(taxon_id))
# same as length(unique(taxon_id))
sp_rich_by_site
## # A tibble: 47 × 2
## siteID sp_richness
## <chr> <int>
## 1 ABBY 81
## 2 BARR 43
## 3 BART 80
## 4 BLAN 82
## 5 BONA 47
## 6 CLBJ 126
## 7 CPER 72
## 8 DCFS 120
## 9 DEJU 52
## 10 DELA 76
## # ℹ 37 more rows
# data can be grouped by multiple variables
sp_rich_by_site_point <- data_bird |>
group_by(siteID, pointID) |>
summarise(sp_richness = n_distinct(taxon_id),
.groups = "drop")
sp_rich_by_site_point
## # A tibble: 303 × 3
## siteID pointID sp_richness
## <chr> <chr> <int>
## 1 ABBY 21 81
## 2 BARR A1 27
## 3 BARR A2 25
## 4 BARR A3 26
## 5 BARR B1 25
## 6 BARR B2 29
## 7 BARR B3 23
## 8 BARR C1 23
## 9 BARR C2 26
## 10 BARR C3 32
## # ℹ 293 more rows
# summarise() can have multiple calculations
data_bird |>
group_by(siteID, pointID) |>
summarise(sp_richness = n_distinct(taxon_id),
abundance = sum(value),
n_obs = n(),
.groups = "drop")
## # A tibble: 303 × 5
## siteID pointID sp_richness abundance n_obs
## <chr> <chr> <int> <dbl> <int>
## 1 ABBY 21 81 1915 1839
## 2 BARR A1 27 409 246
## 3 BARR A2 25 357 223
## 4 BARR A3 26 394 278
## 5 BARR B1 25 497 253
## 6 BARR B2 29 444 277
## 7 BARR B3 23 324 252
## 8 BARR C1 23 329 258
## 9 BARR C2 26 336 265
## 10 BARR C3 32 448 276
## # ℹ 293 more rows
Install the following data package.
## Loading required package: palmerpenguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
arrange()
For the sp_rich_by_site_point
data frame, how do we
order the rows by sp_richness
(high to low)?
## # A tibble: 303 × 3
## siteID pointID sp_richness
## <chr> <chr> <int>
## 1 DCFS 21 120
## 2 NOGP 21 106
## 3 SERC 21 100
## 4 WOOD C1 93
## 5 WOOD B1 92
## 6 WOOD A3 90
## 7 SCBI 21 89
## 8 WOOD A1 88
## 9 WOOD A2 87
## 10 KONA 21 86
## # ℹ 293 more rows
NA
of latitude?
drop_na()
Observations without latitude probably won’t be that useful, how should we remove them? We can do so with Base R:
## [1] 0
## # A tibble: 226,826 × 14
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
Or with filter()
:
## # A tibble: 226,826 × 14
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
Or with the drop_na()
function from tidyr
package.
## # A tibble: 226,826 × 14
## siteID plotID pointID observation_datetime taxon_id taxon_name taxon_rank
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 BART BART_025 C1 2015-06-14 09:23:00 BCCH Poecile atr… species
## 2 BART BART_025 C1 2015-06-14 09:23:00 REVI Vireo oliva… species
## 3 BART BART_025 C1 2015-06-14 09:23:00 BAWW Mniotilta v… species
## 4 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 5 BART BART_025 C1 2015-06-14 09:23:00 BTNW Setophaga v… species
## 6 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 7 BART BART_025 B1 2015-06-14 09:43:00 BAWW Mniotilta v… species
## 8 BART BART_025 B1 2015-06-14 09:43:00 WIWR Troglodytes… species
## 9 BART BART_025 B1 2015-06-14 09:43:00 BTNW Setophaga v… species
## 10 BART BART_025 A1 2015-06-14 10:31:00 REVI Vireo oliva… species
## # ℹ 226,816 more rows
## # ℹ 7 more variables: value <dbl>, unit <chr>, variable_name <chr>,
## # latitude <dbl>, longitude <dbl>, elevation <dbl>, new_column <dbl>
|>
So, each of the above verb can be useful. They do one thing and do
that well. We can combine them with the pipe |>
together
to do complex things at once. This would remove the need to name
intermediate results.
For example, if we want to 1) remove NA
in latitude,
then 2) select only key variables, and 3) focus on some specific sites,
and 4) calculate species richness of each site.
Without pipe, we need to do this:
d = tidyr::drop_na(data_bird, latitude)
d1 = select(d, siteID, taxon_id, value)
d2 = filter(d1, siteID %in% c("BART", "HARV", "BLAN"))
d3 = group_by(d2, siteID)
summarise(d3, sp_richness = n_distinct(taxon_id))
## # A tibble: 3 × 2
## siteID sp_richness
## <chr> <int>
## 1 BART 80
## 2 BLAN 82
## 3 HARV 108
With pipe, we don’t need to save intermediate variables:
data_bird |>
tidyr::drop_na(latitude) |> # d
select(siteID, taxon_id, value) |> # d1
filter(siteID %in% c("BART", "HARV", "BLAN")) |> # d2
group_by(siteID) |> # d3
summarise(sp_richness = n_distinct(taxon_id))
## # A tibble: 3 × 2
## siteID sp_richness
## <chr> <int>
## 1 BART 80
## 2 BLAN 82
## 3 HARV 108
pivot_wider()
The current data_bird
data frame is a long data
frame: each row is an observation, each column is a variable, each cell
is a single value. To learn more about tidy data, see
here.
Some times we need to reshape the long data frame into a wide format: a site by species matrix, with cells to be presence/absence or abundance values. How to do this?
library(tidyr)
# let's select relevant columns first to make it simpler
d = select(data_bird, siteID, plotID, taxon_id, value) |>
distinct() # remove duplicated rows
d
## # A tibble: 28,127 × 4
## siteID plotID taxon_id value
## <chr> <chr> <chr> <dbl>
## 1 BART BART_025 BCCH 1
## 2 BART BART_025 REVI 1
## 3 BART BART_025 BAWW 1
## 4 BART BART_025 BTNW 1
## 5 BART BART_025 WIWR 1
## 6 BART BART_025 AMRE 1
## 7 BART BART_025 SWTH 1
## 8 BART BART_025 WBNU 1
## 9 BART BART_025 LEFL 1
## 10 BART BART_025 BTBW 1
## # ℹ 28,117 more rows
# aggregate first
d_long = d |>
group_by(siteID, plotID, taxon_id) |>
summarise(abundance = sum(value, na.rm = TRUE), .groups = "drop")
d_long
## # A tibble: 23,188 × 4
## siteID plotID taxon_id abundance
## <chr> <chr> <chr> <dbl>
## 1 ABBY ABBY_001 AMCR 2
## 2 ABBY ABBY_001 AMRO 1
## 3 ABBY ABBY_001 BHGR 1
## 4 ABBY ABBY_001 BRCR 1
## 5 ABBY ABBY_001 BTPI 1
## 6 ABBY ABBY_001 BTYW 1
## 7 ABBY ABBY_001 GCKI 1
## 8 ABBY ABBY_001 HEWA 1
## 9 ABBY ABBY_001 HUVI 1
## 10 ABBY ABBY_001 NOFL 1
## # ℹ 23,178 more rows
d_wide = pivot_wider(d_long,
id_cols = c(siteID, plotID),
names_from = taxon_id,
values_from = abundance,
values_fill = 0)
d_wide
## # A tibble: 644 × 578
## siteID plotID AMCR AMRO BHGR BRCR BTPI BTYW GCKI HEWA HUVI NOFL
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABBY ABBY_001 2 1 1 1 1 1 1 1 1 1
## 2 ABBY ABBY_002 0 0 1 0 3 0 0 1 0 1
## 3 ABBY ABBY_003 0 1 1 0 1 1 0 1 0 1
## 4 ABBY ABBY_004 0 1 1 0 1 0 0 1 0 1
## 5 ABBY ABBY_005 1 1 1 0 1 0 0 0 0 1
## 6 ABBY ABBY_006 0 0 1 1 0 0 0 1 0 0
## 7 ABBY ABBY_007 0 1 1 1 0 0 0 0 0 1
## 8 ABBY ABBY_008 0 1 1 1 1 1 3 1 0 1
## 9 ABBY ABBY_009 0 1 1 0 1 0 0 1 0 0
## 10 ABBY ABBY_010 0 1 1 1 0 0 0 1 0 0
## # ℹ 634 more rows
## # ℹ 566 more variables: NOPO <dbl>, PAWR <dbl>, PIWO <dbl>, PSFL <dbl>,
## # RBNU <dbl>, SPTO <dbl>, STJA <dbl>, SWTH <dbl>, WAVI <dbl>, WETA <dbl>,
## # WIWA <dbl>, AMGO <dbl>, BEWR <dbl>, CORA <dbl>, DEJU <dbl>, DUFL <dbl>,
## # HAWO <dbl>, MGWA <dbl>, ORJU <dbl>, OSFL <dbl>, PISI <dbl>, VATH <dbl>,
## # WCSP <dbl>, WIFL <dbl>, BCCH <dbl>, CAJA <dbl>, CBCH <dbl>, MODO <dbl>,
## # OCWA <dbl>, RBSA <dbl>, SOSP <dbl>, UNBI <dbl>, AUWA <dbl>, EVGR <dbl>, …
# we can convert it back to long format
d_wide |>
pivot_longer(cols = -c(siteID, plotID),
names_to = "taxon_id",
values_to = "abundance") |>
filter(abundance > 0)
## # A tibble: 23,188 × 4
## siteID plotID taxon_id abundance
## <chr> <chr> <chr> <dbl>
## 1 ABBY ABBY_001 AMCR 2
## 2 ABBY ABBY_001 AMRO 1
## 3 ABBY ABBY_001 BHGR 1
## 4 ABBY ABBY_001 BRCR 1
## 5 ABBY ABBY_001 BTPI 1
## 6 ABBY ABBY_001 BTYW 1
## 7 ABBY ABBY_001 GCKI 1
## 8 ABBY ABBY_001 HEWA 1
## 9 ABBY ABBY_001 HUVI 1
## 10 ABBY ABBY_001 NOFL 1
## # ℹ 23,178 more rows
left_join()
Here is another data frame that contains site information.
## # A tibble: 81 × 10
## `Site Name` siteID `Domain Name` domainID State Latitude Longitude
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Little Rock Lake LIRO Great Lakes D05 WI 46.0 -89.7
## 2 West St Louis Creek WLOU Southern Roc… D13 CO 39.9 -106.
## 3 Pu'u Maka'ala Natural… PUUM Pacific Trop… D20 HI 19.6 -155.
## 4 Flint River FLNT Southeast D03 GA 31.2 -84.4
## 5 McDiffett Creek MCDI Prairie Peni… D06 KS 38.9 -96.4
## 6 Lewis Run LEWI Mid-Atlantic D02 VA 39.1 -78.0
## 7 Blue River BLUE Southern Pla… D11 OK 34.4 -96.6
## 8 Teakettle 2 Creek TECR Pacific Sout… D17 CA 37.0 -119.
## 9 Lower Hop Brook HOPB Northeast D01 MA 42.5 -72.3
## 10 Martha Creek MART Pacific Nort… D16 WA 45.8 -122.
## # ℹ 71 more rows
## # ℹ 3 more variables: `Site Type` <chr>, `Site Subtype` <chr>,
## # `Site Host` <chr>
How can we add the site information to d_wide
?
## [1] "ABBY" "BARR" "BART" "BLAN" "BONA" "CLBJ" "CPER" "DCFS" "DEJU" "DELA"
## [11] "DSNY" "GRSM" "GUAN" "HARV" "HEAL" "JERC" "JORN" "KONA" "KONZ" "LAJA"
## [21] "LENO" "MLBS" "MOAB" "NIWO" "NOGP" "OAES" "ONAQ" "ORNL" "OSBS" "PUUM"
## [31] "RMNP" "SCBI" "SERC" "SJER" "SOAP" "SRER" "STEI" "STER" "TALL" "TEAK"
## [41] "TOOL" "TREE" "UKFS" "UNDE" "WOOD" "WREF" "YELL"
## # A tibble: 644 × 14
## siteID plotID AMCR AMRO BHGR `Site Name` `Domain Name` domainID State
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 ABBY ABBY_001 2 1 1 Abby Road Pacific Northwe… D16 WA
## 2 ABBY ABBY_002 0 0 1 Abby Road Pacific Northwe… D16 WA
## 3 ABBY ABBY_003 0 1 1 Abby Road Pacific Northwe… D16 WA
## 4 ABBY ABBY_004 0 1 1 Abby Road Pacific Northwe… D16 WA
## 5 ABBY ABBY_005 1 1 1 Abby Road Pacific Northwe… D16 WA
## 6 ABBY ABBY_006 0 0 1 Abby Road Pacific Northwe… D16 WA
## 7 ABBY ABBY_007 0 1 1 Abby Road Pacific Northwe… D16 WA
## 8 ABBY ABBY_008 0 1 1 Abby Road Pacific Northwe… D16 WA
## 9 ABBY ABBY_009 0 1 1 Abby Road Pacific Northwe… D16 WA
## 10 ABBY ABBY_010 0 1 1 Abby Road Pacific Northwe… D16 WA
## # ℹ 634 more rows
## # ℹ 5 more variables: Latitude <dbl>, Longitude <dbl>, `Site Type` <chr>,
## # `Site Subtype` <chr>, `Site Host` <chr>