In today’s lecture, we will continue to talk about R basics. Specifically, we will cover missing values, factors, date times, and subsetting.

Missing values NA

In R, missing values are represented with NA (not applicable). One main feature of NA is that it is infectious: most computations involving a missing value will return another missing value.

NA * 10
## [1] NA
NA + 5
## [1] NA

Exceptions exist when some identity holds for all possible inputs

NA ^ 0
## [1] 1
TRUE | NA
## [1] TRUE
FALSE & NA
## [1] FALSE

Because of these propagation, we need to be very careful when dealing with values including NA. For example

v_na = c(1, 2, 3, NA)
mean(v_na)
## [1] NA
mean(v_na, na.rm = TRUE) # most math function has `na.rm` option
## [1] 2
v_na == NA
## [1] NA NA NA NA
is.na(v_na)
## [1] FALSE FALSE FALSE  TRUE

Four types of missing values: NA (logical), NA_integer_ (integer), NA_real_ (double), and NA_character_ (character). But in most cases, we don’t need to worry about it because NA will be automatically coerced to the correct type.

S3 objects

An S3 object is a base type with at least a class attribute. A generic function can do different things to different S3 objects. An example is the str() function, it returns different outputs when the inputs are different (e.g., a vector vs. a data.frame)

Factor

A factor is a vector that can contain only predefined values. This is normalyl used to store categorical data (e.g., spring, summer, fall, winter). A factor is an integer vector with two attributes: a class (factor) that makes it behave differently from an integer vector, and a levels that defines allowed values.

v_f = factor(v_seasons <- c("spring", "summer", "fall", "winter"))
v_f
## [1] spring summer fall   winter
## Levels: fall spring summer winter
typeof(v_f)
## [1] "integer"
attributes(v_f)
## $levels
## [1] "fall"   "spring" "summer" "winter"
## 
## $class
## [1] "factor"
v_f2 = factor(c("spring", "fall"), levels = v_seasons)
v_f2
## [1] spring fall  
## Levels: spring summer fall winter
table(v_f2) # it will count all levels
## v_f2
## spring summer   fall winter 
##      1      0      1      0
v_f3 = ordered(c("spring", "winter", "fall"), levels = v_seasons)
v_f3
## [1] spring winter fall  
## Levels: spring < summer < fall < winter
v_f3[4] = "weather" # only predefined values allowed
## Warning in `[<-.factor`(`*tmp*`, 4, value = "weather"): invalid factor level, NA
## generated
v_f3
## [1] spring winter fall   <NA>  
## Levels: spring < summer < fall < winter

Note: factors are build on integer vectors, even though they look like characters. So it is usually best to convert factors to character vectors if you need to deal with strings.

as.integer(v_f)
## [1] 2 3 1 4
as.integer(v_f3)
## [1]  1  4  3 NA
as.character(v_f)
## [1] "spring" "summer" "fall"   "winter"

Dates

Date vectors are built on top of double vectors, with a Date class.

today <- Sys.Date()
typeof(today)
## [1] "double"
attributes(today)
## $class
## [1] "Date"

The value represents the number of days since 1970-01-01.

(date <- as.Date("1970-01-02"))
## [1] "1970-01-02"
unclass(date)
## [1] 1

Date-times

Date-time information is saved in base R as two ways: POSIXct and POSIXlt. “POSIX” is short for Portable Operating System Interface, which is a family of cross-platform standards. “ct” stands for calendar time, and “lt” for local time.

POSIXct vectors are built on top of double vectors, where the value represents the number of seconds since 1970-01-01.

now_ct <- as.POSIXct("2021-09-28 09:20", tz = "US/Central")
now_ct
## [1] "2021-09-28 09:20:00 CDT"
typeof(now_ct)
## [1] "double"
attributes(now_ct)
## $class
## [1] "POSIXct" "POSIXt" 
## 
## $tzone
## [1] "US/Central"
unclass(now_ct)
## [1] 1632838800
## attr(,"tzone")
## [1] "US/Central"
structure(now_ct, tzone = "Asia/Shanghai")
## [1] "2021-09-28 22:20:00 CST"

The R package lubridate can make dealing with most date-time data easy. Make sure to install it and check it out.

Subsetting

Till now, we have covered most data structures (str()) in R. Now we are moving on to learn how to access specified elements of each common data structures.

Atomic vectors

We can use [ to select any number of elements from a vector. There are six ways to do so.

Positive integers

x = c(1, 2, 3, 4, 5)
x[c(1, 5)]
## [1] 1 5
# Duplicate indices will duplicate values
x[c(1, 1)]
## [1] 1 1
# Real numbers are silently truncated to integers
x[c(2.1, 3.9)]
## [1] 2 3

Negative integers exclude elements at the specified positions

x[-c(3, 1)]
## [1] 2 4 5
x[c(-1, -3)]
## [1] 2 4 5
x[c(-1, 3)]
# Error in x[c(-1, 3)] : only 0's may be mixed with negative subscripts

Logical vectors select elements where the corresponding logical value is TRUE.

x[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
## [1] 1 3 5
x[c(TRUE, FALSE)] # recycle
## [1] 1 3 5
# NA in, NA out
x[c(TRUE, FALSE, TRUE, NA, TRUE)]
## [1]  1  3 NA  5
x[c(TRUE, NA)] 
## [1]  1 NA  3 NA  5
x[x %% 2 != 0] # Modulus (x mod 2)
## [1] 1 3 5

Nothing returns the original vector

x[]
## [1] 1 2 3 4 5

Zero returns a zero-length vector

x[0]
## numeric(0)

Character vector if the vector is named

names(x) = letters[1:5]
x
## a b c d e 
## 1 2 3 4 5
x[c("c", "a", "b")]
## c a b 
## 3 1 2
x[c("a", "a", "a")]
## a a a 
## 1 1 1
x[factor("b")] # get what??
x[factor("e")] # get what??

Lists

Use [ to select elements of a list will always return results as a list. [[ and $ will extract the elements of a list. [[ is used for extracting single items, while x$y is a useful shorthand for x[["y"]].

x_l = list(c(1:5), letters[1:3], c(TRUE, FALSE))
x_l[c(1, 3)]
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1]  TRUE FALSE
x_l[2]
## [[1]]
## [1] "a" "b" "c"
x_l[[2]]
## [1] "a" "b" "c"
x_l[[c(2, 3)]]
## [1] "c"

When a list is named, we can use $.

names(x_l) = c("A", "B", "C")
x_l
## $A
## [1] 1 2 3 4 5
## 
## $B
## [1] "a" "b" "c"
## 
## $C
## [1]  TRUE FALSE
x_l[["B"]]
## [1] "a" "b" "c"
x_l$B
## [1] "a" "b" "c"

If list x is a train carrying objects, then x[[5]] is the object in car 5; x[4:6] is a train of cars 4-6. — @RLangTip, https://twitter.com/RLangTip/status/268375867468681216

x = list(1:3, "a", 4:6)

Matrix and arrary

You can subset higher-dimensional structures in three ways:

  • With multiple vectors.
  • With a single vector.
  • With a matrix.

The most common way of subsetting matrices (2D) and arrays (>2D) is a simple generalisation of 1D subsetting: supply a 1D index for each dimension, separated by a comma.

Blank subsetting is now useful because it lets you keep all rows or all columns.

a <- matrix(1:9, nrow = 3)
colnames(a) <- c("col_1", "col_2", "col_3")
rownames(a) <- c("row_1", "row_2", "row_3")
a
##       col_1 col_2 col_3
## row_1     1     4     7
## row_2     2     5     8
## row_3     3     6     9
a[1:2, ]
##       col_1 col_2 col_3
## row_1     1     4     7
## row_2     2     5     8
a[c(TRUE, FALSE, TRUE), c("col_3", "col_1")]
##       col_3 col_1
## row_1     7     1
## row_3     9     3
a[0, -2]
##      col_1 col_3

By default, [ will simplify results to the lowest possible dimensionality. For example, if we only select a row or a column of a matrix, then the results will be a vector instead of a matrix.

a[1,]
## col_1 col_2 col_3 
##     1     4     7
class(a[1,])
## [1] "integer"
class(a)
## [1] "matrix" "array"
a[1, , drop = FALSE]
##       col_1 col_2 col_3
## row_1     1     4     7
class(a[1, , drop = FALSE])
## [1] "matrix" "array"
a[2, 2]
## [1] 5
a[2, 2, drop = FALSE]
##       col_2
## row_2     5

Because both matrices and arrays are just vectors with special attributes, you can subset them with a single vector, as if they were a 1D vector.

a[]
##       col_1 col_2 col_3
## row_1     1     4     7
## row_2     2     5     8
## row_3     3     6     9
a[c(1, 5, 9)]
## [1] 1 5 9
a[a %% 2 == 0]
## [1] 2 4 6 8
a2 = a
a2[a2 %% 2 == 0] = -1
a2
##       col_1 col_2 col_3
## row_1     1    -1     7
## row_2    -1     5    -1
## row_3     3    -1     9

It is also possible to subset with a matrix. Each row in the matrix specifies the location of one value, with each column corresponds to a dimension in the matrix or array.

select_mat <- matrix(ncol = 2, byrow = TRUE, c(
  1, 1,
  2, 2,
  3, 3
))
select_mat
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
a[select_mat]
## [1] 1 5 9
a
##       col_1 col_2 col_3
## row_1     1     4     7
## row_2     2     5     8
## row_3     3     6     9
upper.tri(a)
##       [,1]  [,2]  [,3]
## [1,] FALSE  TRUE  TRUE
## [2,] FALSE FALSE  TRUE
## [3,] FALSE FALSE FALSE
a[upper.tri(a)]
## [1] 4 7 8

Data frame

Recall that data frame has properties of both list and matrix. So when subset a data frame with one vector, it will act as a list and return the elements (columns) of the list.

(df = data.frame(x = 1:5, y = rnorm(5), z = letters[1:5]))
##   x          y z
## 1 1 -2.1087062 a
## 2 2  0.6364000 b
## 3 3  0.3141892 c
## 4 4  0.5340274 d
## 5 5 -0.5669961 e
df[c(1, 3)]
##   x z
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
df[c("x", "z")]
##   x z
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e

When subset with two indices, it will act as a matrix.

df[1:2, 2:3]
##           y z
## 1 -2.108706 a
## 2  0.636400 b
df[1:3,]
##   x          y z
## 1 1 -2.1087062 a
## 2 2  0.6364000 b
## 3 3  0.3141892 c
df[, 1:2]
##   x          y
## 1 1 -2.1087062
## 2 2  0.6364000
## 3 3  0.3141892
## 4 4  0.5340274
## 5 5 -0.5669961
df[df$x == 3,]
##   x         y z
## 3 3 0.3141892 c
df[df$x %% 2 == 0,]
##   x         y z
## 2 2 0.6364000 b
## 4 4 0.5340274 d
df[df$x %% 2 == 0 | df$x == 3,]
##   x         y z
## 2 2 0.6364000 b
## 3 3 0.3141892 c
## 4 4 0.5340274 d
df[3,]
##   x         y z
## 3 3 0.3141892 c
str(df[3,])
## 'data.frame':    1 obs. of  3 variables:
##  $ x: int 3
##  $ y: num 0.314
##  $ z: chr "c"
df[3, 3]
## [1] "c"
str(df[3, 3])
##  chr "c"
df[3, 3, drop = FALSE]
##   z
## 3 c
str(df[3, 3, drop = FALSE])
## 'data.frame':    1 obs. of  1 variable:
##  $ z: chr "c"

The default drop = TRUE behaviour is a common source of bugs in functions; try to use drop = FALSE when subsetting a 2-D object. Or try the tibble package, which uses tibble class to represent data frame and will always use drop = FALSE.

S4 object and @ and slot()

For S4 objects, we can use two operators to subset: @ (equivalent to $) and slot() (equivalent to [[). We won’t talk more about it in this course.

Reference

Most of this lecture’s materials is from Advanced R