Data input

We can get data into R through keyboard, from the clipboard, or from an external file (local or online).

From keyboard

We have learned the c() function to concatenate data as a vector.

We can also use the scan() function if we want to type or paste a few numbers into a vector from the keyboard.

Demo

From clipboard

You can also use scan to paste in groups of numbers from the clipboard. In Excel, highlight the column of numbers you want, then type Ctrl+C. Now go back into R. At the 1: prompt just type Ctrl+V and the numbers will be scanned into R.

Demo

# windows
z <- scan("clipboard", what = numeric())
# what=character(), sep=","
# macOS
z <- scan(pipe("pbpaste"), what = numeric())

From external files

If you use Excel, try to export the data as text file (e.g., .csv, tsv). It is possible that you cannot open a 10-year old Excel file, but plain text files will always work.

When write code to read files, always use relative path instead of absolute path.

Base R functions

Existing datasets provided by the base package of R.

data()
# read.table()
args(read.table)
## function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", 
##     numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, 
##     col.names, as.is = !stringsAsFactors, tryLogical = TRUE, 
##     na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, 
##     check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, 
##     blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, 
##     flush = FALSE, stringsAsFactors = FALSE, fileEncoding = "", 
##     encoding = "unknown", text, skipNul = FALSE) 
## NULL
read.table(file = "../02_proj_cycle/view.csv", header = TRUE, sep = ",")
##     X                                                       names   views
## 1   1 An Interview with Gilbert Strang on Teaching Linear Algebra  531657
## 2   2                         1. The Geometry of Linear Equations  749756
## 3   3                               2. Elimination with Matrices. 1651140
## 4   4                      3. Multiplication and Inverse Matrices 1149974
## 5   5                                4. Factorization into A = LU  431759
## 6   6                     5. Transposes, Permutations, Spaces R^n  669672
## 7   7                               6. Column Space and Nullspace  633672
## 8   8       7. Solving Ax = 0: Pivot Variables, Special Solutions  512615
## 9   9                       8. Solving Ax = b: Row Reduced Form R  463494
## 10 10                       9. Independence, Basis, and Dimension  486194
## 11 11                          10. The Four Fundamental Subspaces  450159
## 12 12               11. Matrix Spaces; Rank 1; Small World Graphs  339305
## 13 13                    12. Graphs, Networks, Incidence Matrices  283376
## 14 14                                           13. Quiz 1 Review  246720
## 15 15                        14. Orthogonal Vectors and Subspaces  368634
## 16 16                              15. Projections onto Subspaces  361465
## 17 17                   16. Projection Matrices and Least Squares  328095
## 18 18                    17. Orthogonal Matrices and Gram-Schmidt   87774
## 19 19                              18. Properties of Determinants  290485
## 20 20                      19. Determinant Formulas and Cofactors  261363
## 21 21               20. Cramer's Rule, Inverse Matrix, and Volume  253171
## 22 22                            21. Eigenvalues and Eigenvectors  269982
## 23 23                         22. Diagonalization and Powers of A  351739
## 24 24                      23. Differential Equations and exp(At)  261975
## 25 25                         24. Markov Matrices; Fourier Series   49059
## 26 26                                          24b. Quiz 2 Review   17764
## 27 27            25. Symmetric Matrices and Positive Definiteness   56173
## 28 28                26. Complex Matrices; Fast Fourier Transform  189655
## 29 29                   27. Positive Definite Matrices and Minima  184409
## 30 30                        28. Similar Matrices and Jordan Form   50385
## 31 31                            29. Singular Value Decomposition   58854
## 32 32               30. Linear Transformations and Their Matrices  291459
## 33 33                      31. Change of Basis; Image Compression   37131
## 34 34                                           32. Quiz 3 Review  113571
## 35 35                  33. Left and Right Inverses; Pseudoinverse  164589
## 36 36                                     34. Final Course Review  150043
read.table(file = "../02_proj_cycle/view.csv", header = TRUE, sep = ",", row.names = 1)
##                                                          names   views
## 1  An Interview with Gilbert Strang on Teaching Linear Algebra  531657
## 2                          1. The Geometry of Linear Equations  749756
## 3                                2. Elimination with Matrices. 1651140
## 4                       3. Multiplication and Inverse Matrices 1149974
## 5                                 4. Factorization into A = LU  431759
## 6                      5. Transposes, Permutations, Spaces R^n  669672
## 7                                6. Column Space and Nullspace  633672
## 8        7. Solving Ax = 0: Pivot Variables, Special Solutions  512615
## 9                        8. Solving Ax = b: Row Reduced Form R  463494
## 10                       9. Independence, Basis, and Dimension  486194
## 11                          10. The Four Fundamental Subspaces  450159
## 12               11. Matrix Spaces; Rank 1; Small World Graphs  339305
## 13                    12. Graphs, Networks, Incidence Matrices  283376
## 14                                           13. Quiz 1 Review  246720
## 15                        14. Orthogonal Vectors and Subspaces  368634
## 16                              15. Projections onto Subspaces  361465
## 17                   16. Projection Matrices and Least Squares  328095
## 18                    17. Orthogonal Matrices and Gram-Schmidt   87774
## 19                              18. Properties of Determinants  290485
## 20                      19. Determinant Formulas and Cofactors  261363
## 21               20. Cramer's Rule, Inverse Matrix, and Volume  253171
## 22                            21. Eigenvalues and Eigenvectors  269982
## 23                         22. Diagonalization and Powers of A  351739
## 24                      23. Differential Equations and exp(At)  261975
## 25                         24. Markov Matrices; Fourier Series   49059
## 26                                          24b. Quiz 2 Review   17764
## 27            25. Symmetric Matrices and Positive Definiteness   56173
## 28                26. Complex Matrices; Fast Fourier Transform  189655
## 29                   27. Positive Definite Matrices and Minima  184409
## 30                        28. Similar Matrices and Jordan Form   50385
## 31                            29. Singular Value Decomposition   58854
## 32               30. Linear Transformations and Their Matrices  291459
## 33                      31. Change of Basis; Image Compression   37131
## 34                                           32. Quiz 3 Review  113571
## 35                  33. Left and Right Inverses; Pseudoinverse  164589
## 36                                     34. Final Course Review  150043
# read.delim()
args(read.delim)
## function (file, header = TRUE, sep = "\t", quote = "\"", dec = ".", 
##     fill = TRUE, comment.char = "", ...) 
## NULL
# read.csv()
args(read.csv)
## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
##     fill = TRUE, comment.char = "", ...) 
## NULL
# read.table(file = file.choose()) # not recommended

# readLines()
args(readLines)
## function (con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = "unknown", 
##     skipNul = FALSE) 
## NULL
# load()
args(load) # .RData
## function (file, envir = parent.frame(), verbose = FALSE) 
## NULL
# x <- readRDS() # can rename it
args(readRDS) # for individual file .rds
## function (file, refhook = NULL) 
## NULL

Some useful functions

dir.exists()
dir.create()
file.exists()
file.create()

Non-base packages/functions

One of the most popular package is readr, which is a core package of the tidyverse. From its webpage:

The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.

  • read_csv(): comma separated (CSV) files
  • read_tsv(): tab separated files
  • read_delim(): general delimited files
  • read_fwf(): fixed width files
  • read_table(): tabular files where columns are separated by white-space.
  • read_log(): web log files
library(readr)
read_csv(readr_example("mtcars.csv"))
## Rows: 32 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ℹ 22 more rows

Other packages:

  • haven reads SPSS, Stata, and SAS files.
  • readxl reads excel files (both .xls and .xlsx).
  • For hierarchical data: use jsonlite for json, and xml2 for XML

Large data? If data size is too large to be read into R (R read data into memory), then R package DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame. Another useful package is dbplyr if you are used to the dplyr package. dbplyr is the database backend for dplyr. It allows you to use remote database tables as if they are in-memory data frames by automatically converting dplyr code into SQL.

The fst package for R provides a fast, easy and flexible way to serialize data frames. With access speeds of multiple GB/s, fst is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers. Data frames stored in the fst format have full random access, both in column and rows.

Also check out the vroom package: the fastest delimited reader for R, 1.23 GB/sec.

Data output

After we finished data cleaning, we genrally want to save the cleaned data as external files so that we can use them directly next time.

If the data size is relatively small, try to save data as plain text files such as .csv files. Otherwise, we can save data as compressed binary files. It will be smaller but will need specific tools (R here) to open them.

Base R functions

With R or RStudio, when we quit, the program normally will ask us whether we want to save the workspace. If we choose so, every objects we created in R will be saved into one file (default to be .RData) at the root directory. Next time, when we open R or RStudio, this file will be automatically loaded so that we can have access to all objects we created previously (recall the load() in the data input section). This is convient, but I don’t recommend it as your code may not be reproducible. For example, if we created an object but did not save the code; next time when we use R, we still have the object in our computer. But if we share the code with others, they won’t be able to run it.

In RStudio, I recommend to set the Save the workspace as an image on exit to never.

It is better to save key objects/data as their own external files.

# writeClipboard(as.character(numeric.variables)) # go to Excel, Ctrl+V

# write.csv()
args(write.csv)
## function (...) 
## NULL
args(write.table)
## function (x, file = "", append = FALSE, quote = TRUE, sep = " ", 
##     eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, 
##     qmethod = c("escape", "double"), fileEncoding = "") 
## NULL
# save() # one or multiple objects
args(save)
## function (..., list = character(), file = stop("'file' must be specified"), 
##     ascii = FALSE, version = NULL, envir = parent.frame(), compress = isTRUE(!ascii), 
##     compression_level, eval.promises = TRUE, precheck = TRUE) 
## NULL
args(save.image) # save all objects
## function (file = ".RData", version = NULL, ascii = FALSE, compress = !ascii, 
##     safe = TRUE) 
## NULL
# saveRDS()
args(saveRDS) # to save one object
## function (object, file = "", ascii = FALSE, version = NULL, compress = TRUE, 
##     refhook = NULL) 
## NULL

Naming things

Be sure to take a look at this excellent slide!

Non-base packages

Again, readr has corresponding write functions. They are an improvement to analogous base R functions.

# write_csv()
args(write_csv)
## function (x, file, na = "NA", append = FALSE, col_names = !append, 
##     quote = c("needed", "all", "none"), escape = c("double", 
##         "backslash", "none"), eol = "\n", num_threads = readr_threads(), 
##     progress = show_progress(), path = deprecated(), quote_escape = deprecated()) 
## NULL
# write_rds()
args(write_rds)
## function (x, file, compress = c("none", "gz", "bz2", "xz"), version = 2, 
##     refhook = NULL, text = FALSE, path = deprecated(), ...) 
## NULL

Exercise

Download this file: https://figshare.com/ndownloader/files/17461766 to your computer. Then try to read it into R. You can try read.table, read.csv, read_csv, etc.

Then save it to your disk. Again, you can try write.csv or write_csv or write_rds.