In today’s lecture, we will continue to talk about R basics. Specifically, we will cover missing values, factors, date times, and subsetting.
NA
In R, missing values are represented with NA
(not
applicable). One main feature of NA
is that it is
infectious: most computations involving a missing value will
return another missing value.
## [1] NA
## [1] NA
Exceptions exist when some identity holds for all possible inputs
## [1] 1
## [1] TRUE
## [1] FALSE
Because of these propagation, we need to be very careful when dealing
with values including NA
. For example
## [1] NA
## [1] 2
## [1] NA NA NA NA
## [1] FALSE FALSE FALSE TRUE
Four types of missing values: NA
(logical),
NA_integer_
(integer), NA_real_
(double), and
NA_character_
(character). But in most cases, we don’t need
to worry about it because NA
will be automatically coerced
to the correct type.
An S3 object is a base type with at least a class
attribute. A generic function can do different things to different S3
objects. An example is the str()
function, it returns
different outputs when the inputs are different (e.g., a vector vs. a
data.frame)
A factor is a vector that can contain only predefined values. This is
normalyl used to store categorical data (e.g., spring, summer, fall,
winter). A factor is an integer vector with two attributes: a
class
(factor) that makes it behave differently from an
integer vector, and a levels
that defines allowed
values.
## [1] spring summer fall winter
## Levels: fall spring summer winter
## [1] "integer"
## $levels
## [1] "fall" "spring" "summer" "winter"
##
## $class
## [1] "factor"
## [1] spring fall
## Levels: spring summer fall winter
## v_f2
## spring summer fall winter
## 1 0 1 0
## [1] spring winter fall
## Levels: spring < summer < fall < winter
## Warning in `[<-.factor`(`*tmp*`, 4, value = "weather"): invalid factor level, NA
## generated
## [1] spring winter fall <NA>
## Levels: spring < summer < fall < winter
Note: factors are build on integer vectors, even though they look like characters. So it is usually best to convert factors to character vectors if you need to deal with strings.
## [1] 2 3 1 4
## [1] 1 4 3 NA
## [1] "spring" "summer" "fall" "winter"
Date vectors are built on top of double vectors, with a
Date
class.
## [1] "double"
## $class
## [1] "Date"
The value represents the number of days since 1970-01-01.
## [1] "1970-01-02"
## [1] 1
Date-time information is saved in base R as two ways: POSIXct and POSIXlt. “POSIX” is short for Portable Operating System Interface, which is a family of cross-platform standards. “ct” stands for calendar time, and “lt” for local time.
POSIXct vectors are built on top of double vectors, where the value represents the number of seconds since 1970-01-01.
## [1] "2021-09-28 09:20:00 CDT"
## [1] "double"
## $class
## [1] "POSIXct" "POSIXt"
##
## $tzone
## [1] "US/Central"
## [1] 1632838800
## attr(,"tzone")
## [1] "US/Central"
## [1] "2021-09-28 22:20:00 CST"
The R package
lubridate
can make dealing with most date-time data
easy. Make sure to install it and check it out.
Till now, we have covered most data structures (str()
)
in R. Now we are moving on to learn how to access specified elements of
each common data structures.
[
, [[
, and
$
; they interact differently with different vector
typesWe can use [
to select any number of elements from a
vector. There are six ways to do so.
## [1] 1 5
## [1] 1 1
## [1] 2 3
## [1] 2 4 5
## [1] 2 4 5
TRUE
.## [1] 1 3 5
## [1] 1 3 5
## [1] 1 3 NA 5
## [1] 1 NA 3 NA 5
## [1] 1 3 5
Use [
to select elements of a list will always return
results as a list. [[
and $
will extract the
elements of a list. [[
is used for extracting single items,
while x$y
is a useful shorthand for
x[["y"]]
.
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] TRUE FALSE
## [[1]]
## [1] "a" "b" "c"
## [1] "a" "b" "c"
## [1] "c"
When a list is named, we can use $
.
## $A
## [1] 1 2 3 4 5
##
## $B
## [1] "a" "b" "c"
##
## $C
## [1] TRUE FALSE
## [1] "a" "b" "c"
## [1] "a" "b" "c"
If list
x
is a train carrying objects, thenx[[5]]
is the object in car 5;x[4:6]
is a train of cars 4-6. — @RLangTip, https://twitter.com/RLangTip/status/268375867468681216
You can subset higher-dimensional structures in three ways:
The most common way of subsetting matrices (2D) and arrays (>2D) is a simple generalisation of 1D subsetting: supply a 1D index for each dimension, separated by a comma.
Blank subsetting is now useful because it lets you keep all rows or all columns.
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("col_1", "col_2", "col_3")
rownames(a) <- c("row_1", "row_2", "row_3")
a
## col_1 col_2 col_3
## row_1 1 4 7
## row_2 2 5 8
## row_3 3 6 9
## col_1 col_2 col_3
## row_1 1 4 7
## row_2 2 5 8
## col_3 col_1
## row_1 7 1
## row_3 9 3
## col_1 col_3
By default, [
will simplify results to the lowest
possible dimensionality. For example, if we only select a row or a
column of a matrix, then the results will be a vector instead of a
matrix.
## col_1 col_2 col_3
## 1 4 7
## [1] "integer"
## [1] "matrix" "array"
## col_1 col_2 col_3
## row_1 1 4 7
## [1] "matrix" "array"
## [1] 5
## col_2
## row_2 5
Because both matrices and arrays are just vectors with special attributes, you can subset them with a single vector, as if they were a 1D vector.
## col_1 col_2 col_3
## row_1 1 4 7
## row_2 2 5 8
## row_3 3 6 9
## [1] 1 5 9
## [1] 2 4 6 8
## col_1 col_2 col_3
## row_1 1 -1 7
## row_2 -1 5 -1
## row_3 3 -1 9
It is also possible to subset with a matrix. Each row in the matrix specifies the location of one value, with each column corresponds to a dimension in the matrix or array.
## [,1] [,2]
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3
## [1] 1 5 9
## col_1 col_2 col_3
## row_1 1 4 7
## row_2 2 5 8
## row_3 3 6 9
## [,1] [,2] [,3]
## [1,] FALSE TRUE TRUE
## [2,] FALSE FALSE TRUE
## [3,] FALSE FALSE FALSE
## [1] 4 7 8
Recall that data frame has properties of both list and matrix. So when subset a data frame with one vector, it will act as a list and return the elements (columns) of the list.
## x y z
## 1 1 -2.1087062 a
## 2 2 0.6364000 b
## 3 3 0.3141892 c
## 4 4 0.5340274 d
## 5 5 -0.5669961 e
## x z
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## x z
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
When subset with two indices, it will act as a matrix.
## y z
## 1 -2.108706 a
## 2 0.636400 b
## x y z
## 1 1 -2.1087062 a
## 2 2 0.6364000 b
## 3 3 0.3141892 c
## x y
## 1 1 -2.1087062
## 2 2 0.6364000
## 3 3 0.3141892
## 4 4 0.5340274
## 5 5 -0.5669961
## x y z
## 3 3 0.3141892 c
## x y z
## 2 2 0.6364000 b
## 4 4 0.5340274 d
## x y z
## 2 2 0.6364000 b
## 3 3 0.3141892 c
## 4 4 0.5340274 d
## x y z
## 3 3 0.3141892 c
## 'data.frame': 1 obs. of 3 variables:
## $ x: int 3
## $ y: num 0.314
## $ z: chr "c"
## [1] "c"
## chr "c"
## z
## 3 c
## 'data.frame': 1 obs. of 1 variable:
## $ z: chr "c"
The default drop = TRUE
behaviour is a common source of
bugs in functions; try to use drop = FALSE
when subsetting
a 2-D object. Or try the tibble
package, which uses
tibble
class to represent data frame and will always use
drop = FALSE
.
@
and slot()
For S4 objects, we can use two operators to subset: @
(equivalent to $
) and slot()
(equivalent to
[[
). We won’t talk more about it in this course.
Most of this lecture’s materials is from Advanced R