class: center, middle, inverse, title-slide .title[ # R Basics ] .subtitle[ ## Introduction to Data Science (BIOL7800)
https://introdatasci.dlilab.com/
] .author[ ### Daijiang Li ] .institute[ ### LSU ] .date[ ### 2023/09/12 ] --- # Data types ### The first step in any data analysis is to choose the structure and to create a dataset to hold the data ### R has a wide variety of structures for holding data, including scalars, vectors, arrays, data frames, and lists. --- class: middle # Data structures .pull-left[ .font160[ |Dimensions |Homogeneous |Heterogeneous | |:----------|:---------------------|:---------------| |1d |.cyan[Vector (atomic)] |List (generic) | |2d |.cyan[Matrix] |.cyan[Data frame] | |nd |Array |NA | ] #### Almost all other objects are build upon these foundations. #### `str()` to understand data structure ] .pull-right[ ![data structure in R](figs/data_structure.png) ] --- # Vector Vector types: .cyan[logical], .cyan[double], .cyan[integer]<sup>1</sup> , .cyan[character], complex (imaginary numbers), and raw (bytes) Go-to function for making vectors: `c()` ```r (a <- c(1:3)) # equal to: a <- c(1:3); a ``` ``` ## [1] 1 2 3 ``` ```r (b <- c(4:6)) ``` ``` ## [1] 4 5 6 ``` ```r (C <- c(a, b)) # don't name it as c! ``` ``` ## [1] 1 2 3 4 5 6 ``` .footnote[ [1] `double` and `integer` are both `numeric` ] --- # Vector .font200[ Vectors have three common properties: - Type (what it is), `typeof()` - Length (how many elements), `length()` - Attributes (additional arbitrary metadata) `attributes()` ] ```r typeof(a) ``` ``` ## [1] "integer" ``` ```r length(a) ``` ``` ## [1] 3 ``` ```r attributes(a) ``` ``` ## NULL ``` --- # Vector .pull-left[ ```r (v_dbl = c(1, 3.1)) ``` ``` ## [1] 1.0 3.1 ``` ```r (v_int = c(0L:3L)) # colon operator ``` ``` ## [1] 0 1 2 3 ``` ```r (v_log = c(TRUE, FALSE)) # T, F ``` ``` ## [1] TRUE FALSE ``` ```r (v_chr = c("a", "word")) ``` ``` ## [1] "a" "word" ``` ] -- .pull-right[ ```r typeof(v_dbl) ``` ``` ## [1] "double" ``` ```r is.double(v_dbl) ``` ``` ## [1] TRUE ``` ```r is.numeric(v_int) ``` ``` ## [1] TRUE ``` ```r is.integer(v_int) ``` ``` ## [1] TRUE ``` ```r is.atomic(v_log) ``` ``` ## [1] TRUE ``` ] --- # Coercion Vector only allow **one** type of elements; so when mix different types of elements, they will be coerced to the most flexible type (**least to most flexible: logical, integer, double, character**) .pull-left[ ```r c(v_log, v_int) ``` ``` ## [1] 1 0 0 1 2 3 ``` ```r c(v_log, v_chr) ``` ``` ## [1] "TRUE" "FALSE" "a" "word" ``` ```r c(v_dbl, v_int) ``` ``` ## [1] 1.0 3.1 0.0 1.0 2.0 3.0 ``` ```r c(v_dbl, v_chr) ``` ``` ## [1] "1" "3.1" "a" "word" ``` ] .pull-right[ ```r typeof(c(v_log, v_int)) ``` ``` ## [1] "integer" ``` ```r typeof(c(v_log, v_chr)) ``` ``` ## [1] "character" ``` ```r typeof(c(v_dbl, v_int)) ``` ``` ## [1] "double" ``` ```r typeof(c(v_dbl, v_chr)) ``` ``` ## [1] "character" ``` ] --- # Coercion and math functions ### Coercion often happens automatically ```r v_log2 = c(TRUE, FALSE, TRUE, TRUE, FALSE) sum(v_log2) ``` ``` ## [1] 3 ``` ```r mean(v_log2) ``` ``` ## [1] 0.6 ``` --- class: inverse, middle ## How do you get the number of positive values in the vector below using the coercion example in the previous slide? ```r v_norm = rnorm(n = 1000, mean = 0, sd = 2) head(v_norm, n = 10) ``` ``` ## [1] -1.10087437 1.28091634 1.51635608 -2.67190898 2.76741063 1.37167187 ## [7] -3.43942011 -1.20752287 0.07888408 -0.81300381 ``` ??? take a minute to discuss with others --- ## Coercion on purpose ```r as.integer(v_log2) ``` ``` ## [1] 1 0 1 1 0 ``` ```r as.character(v_dbl) ``` ``` ## [1] "1" "3.1" ``` ```r as.logical(v_int) ``` ``` ## [1] FALSE TRUE TRUE TRUE ``` ```r as.numeric(v_log2) ``` ``` ## [1] 1 0 1 1 0 ``` ```r as.numeric(v_chr) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] NA NA ``` --- # Vector names .pull-left[ ### Three ways to add names ```r (v1 = c(a = 1, b = 2)) # 1 ``` ``` ## a b ## 1 2 ``` ```r v2 = 1:2 names(v2) = c("a", "b") # 2 v2 ``` ``` ## a b ## 1 2 ``` ```r setNames(1:2, c("a", "b")) # 3 ``` ``` ## a b ## 1 2 ``` ] -- .pull-right[ ### Remove names ```r unname(v1) ``` ``` ## [1] 1 2 ``` ```r names(v2) = NULL v2 ``` ``` ## [1] 1 2 ``` ] --- # Lists ### Lists are different from atomic vectors above because their elements can be of any type, including lists (thus they are **recursive** vectors) ```r x = list(1:3, "a", c(TRUE, FALSE), list(2:1, "b")) str(x) ``` ``` ## List of 4 ## $ : int [1:3] 1 2 3 ## $ : chr "a" ## $ : logi [1:2] TRUE FALSE ## $ :List of 2 ## ..$ : int [1:2] 2 1 ## ..$ : chr "b" ``` ```r is.recursive(x) ``` ``` ## [1] TRUE ``` --- # Lists .pull-left[ ```r l1 = list(list(1, 2), c(3, 4)) str(l1) ``` ``` ## List of 2 ## $ :List of 2 ## ..$ : num 1 ## ..$ : num 2 ## $ : num [1:2] 3 4 ``` ] .pull-right[ ```r l2 = c(list(1, 2), c(3, 4)) str(l2) ``` ``` ## List of 4 ## $ : num 1 ## $ : num 2 ## $ : num 3 ## $ : num 4 ``` ] -- ```r typeof(l1) ``` ``` ## [1] "list" ``` ```r unlist(l1) # back to atomic vector ``` ``` ## [1] 1 2 3 4 ``` --- # List names ```r names(l2) ``` ``` ## NULL ``` ```r names(l2) = c("name_1", "name_2") str(l2) ``` ``` ## List of 4 ## $ name_1: num 1 ## $ name_2: num 2 ## $ NA : num 3 ## $ NA : num 4 ``` ```r l3 = list(lst_a = c(1:5), lst_b = letters[1:3], LETTERS[1:3]) str(l3) ``` ``` ## List of 3 ## $ lst_a: int [1:5] 1 2 3 4 5 ## $ lst_b: chr [1:3] "a" "b" "c" ## $ : chr [1:3] "A" "B" "C" ``` ```r names(l3) ``` ``` ## [1] "lst_a" "lst_b" "" ``` --- layout: true # Matrix --- .pull-left[ ```r matrix(data = 0, nrow = 3, ncol = 3) ``` ``` ## [,1] [,2] [,3] ## [1,] 0 0 0 ## [2,] 0 0 0 ## [3,] 0 0 0 ``` ```r matrix(data = 1:9, nrow = 3, ncol = 3) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 ``` ] -- .pull-right[ ```r matrix(data = letters[1:9], nrow = 3, ncol = 3) ``` ``` ## [,1] [,2] [,3] ## [1,] "a" "d" "g" ## [2,] "b" "e" "h" ## [3,] "c" "f" "i" ``` ```r matrix(data = LETTERS[1:9], nrow = 3, ncol = 3) ``` ``` ## [,1] [,2] [,3] ## [1,] "A" "D" "G" ## [2,] "B" "E" "H" ## [3,] "C" "F" "I" ``` ] --- ```r mat_a <- matrix(data = 1:9, nrow = 3, ncol = 3, * byrow = TRUE ) mat_a ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 ``` ```r rownames(mat_a) <- c("row1", "row2", "row3") colnames(mat_a) <- c("col1", "col2", "col3") mat_a ``` ``` ## col1 col2 col3 ## row1 1 2 3 ## row2 4 5 6 ## row3 7 8 9 ``` --- ### Coercion ```r mat_b <- mat_a mat_b[9] = "n9" mat_b ``` ``` ## col1 col2 col3 ## row1 "1" "2" "3" ## row2 "4" "5" "6" ## row3 "7" "8" "n9" ``` ```r class(mat_b) ``` ``` ## [1] "matrix" "array" ``` ```r typeof(mat_b) ``` ``` ## [1] "character" ``` ??? matrix also has type conversion --- .pull-left[ ```r upper.tri(mat_a, diag = FALSE) ``` ``` ## [,1] [,2] [,3] ## [1,] FALSE TRUE TRUE ## [2,] FALSE FALSE TRUE ## [3,] FALSE FALSE FALSE ``` ```r mat_a ``` ``` ## col1 col2 col3 ## row1 1 2 3 ## row2 4 5 6 ## row3 7 8 9 ``` ] .pull-right[ ```r (idx = lower.tri(mat_a, diag = TRUE)) ``` ``` ## [,1] [,2] [,3] ## [1,] TRUE FALSE FALSE ## [2,] TRUE TRUE FALSE ## [3,] TRUE TRUE TRUE ``` ```r mat_a[idx] ``` ``` ## [1] 1 4 7 5 8 9 ``` ] --- layout: false # Arrays .pull-left[ ```r a = array(data = 1:12, dim = c(2, 3, 2)) a ``` ``` ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 7 9 11 ## [2,] 8 10 12 ``` ] .pull-right[ ```r length(a) ``` ``` ## [1] 12 ``` ```r dim(a) ``` ``` ## [1] 2 3 2 ``` ```r str(a) ``` ``` ## int [1:2, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ... ``` ```r class(a) ``` ``` ## [1] "array" ``` ```r typeof(a) ``` ``` ## [1] "integer" ``` ] --- # Arrays: dimension names .pull-left[ ```r dimnames(a) = list(c("R1", "R2"), c("C1", "C2", "C3"), c("A", "B")) a ``` ``` ## , , A ## ## C1 C2 C3 ## R1 1 3 5 ## R2 2 4 6 ## ## , , B ## ## C1 C2 C3 ## R1 7 9 11 ## R2 8 10 12 ``` ] -- .pull-right[ ```r a2 = array(data = 1:12, dim = c(2, 3, 2), dimnames = list(c("R1", "R2"), c("C1", "C2", "C3"), c("A", "B"))) a2 ``` ``` ## , , A ## ## C1 C2 C3 ## R1 1 3 5 ## R2 2 4 6 ## ## , , B ## ## C1 C2 C3 ## R1 7 9 11 ## R2 8 10 12 ``` ] --- class: inverse, middle # How the three objects below are different from vector 1:5? ```r x1 = array(1:5, c(1, 1, 5)) x2 = array(1:5, c(1, 5, 1)) x3 = array(1:5, c(5, 1, 1)) ``` --- # Data frames ### A data frame is more general than a matrix in that different columns can be different modes of data; **it will be the most common data structure we'll deal with in R**. .pull-left[ ```r d = data.frame(v_dbl, v_log, v_chr) d ``` ``` ## v_dbl v_log v_chr ## 1 1.0 TRUE a ## 2 3.1 FALSE word ``` ] .pull-right[ ```r str(d) ``` ``` ## 'data.frame': 2 obs. of 3 variables: ## $ v_dbl: num 1 3.1 ## $ v_log: logi TRUE FALSE ## $ v_chr: chr "a" "word" ``` ```r length(d) ``` ``` ## [1] 3 ``` ] --- # Data frames ### A data frame is just **a list of equal-length vectors**; therefore it shares properties of both matrix and list .pull-left[ ```r d ``` ``` ## v_dbl v_log v_chr ## 1 1.0 TRUE a ## 2 3.1 FALSE word ``` ```r # a list of equal length vector typeof(d) ``` ``` ## [1] "list" ``` ```r class(d) ``` ``` ## [1] "data.frame" ``` ```r is.data.frame(d) ``` ``` ## [1] TRUE ``` ] .pull-right[ ```r names(d) ``` ``` ## [1] "v_dbl" "v_log" "v_chr" ``` ```r colnames(d) ``` ``` ## [1] "v_dbl" "v_log" "v_chr" ``` ```r rownames(d) ``` ``` ## [1] "1" "2" ``` ] --- # as.data.frame() ```r as.data.frame(c(1:2)) ``` ``` ## c(1:2) ## 1 1 ## 2 2 ``` ```r as.data.frame(mat_a) ``` ``` ## col1 col2 col3 ## row1 1 2 3 ## row2 4 5 6 ## row3 7 8 9 ``` ```r as.data.frame(l2) ``` ``` ## name_1 name_2 NA. NA..1 ## 1 1 2 3 4 ``` --- # Combine data frames .pull-left[ ### stack data frames ```r d_row = data.frame(1, 2, "3") names(d_row) = names(d) rbind(d, d_row) ``` ``` ## v_dbl v_log v_chr ## 1 1.0 1 a ## 2 3.1 0 word ## 3 1.0 2 3 ``` ```r dplyr::bind_rows(d, d_row) ``` ``` ## v_dbl v_log v_chr ## 1 1.0 1 a ## 2 3.1 0 word ## 3 1.0 2 3 ``` ] .pull-right[ ### data frames side by side ```r d_col = data.frame(x1 = 1:2) cbind(d, d_col) ``` ``` ## v_dbl v_log v_chr x1 ## 1 1.0 TRUE a 1 ## 2 3.1 FALSE word 2 ``` ```r dplyr::bind_cols(d, d_col) ``` ``` ## v_dbl v_log v_chr x1 ## 1 1.0 TRUE a 1 ## 2 3.1 FALSE word 2 ``` ]