Define the question of interest
Get the data
Clean and prepare the data
Explore the data
Fit models to extract insights
Tell, explain, and illustrate results
1: Pronounced as 'awesome'
Obtaining data
Scrubbing data
Exploring data
Modeling data
iNterpreting data
After defining your question, the first step is to obtain data
Query data from a database or API (e.g., MySQL, Twitter, GBIF)
Download data from another location (e.g., a server, ftp)
Extract data from other files (e.g., html webpage, spreadsheet)
Generate your own data (e.g., simulation, experiment)
Relational database (e.g., SQLite, PostgreSQL, Spark), use API (e.g., R packages dbplyr
, DBI
)
Downloading data programmingly (web scrapping, curl
, R packages httr
, rvest
)
Understanding of file system; decompress and manage files, etc.
API: application programming interface
library(rvest, warn.conflicts = FALSE)library(RSelenium)# to set up a server to run javascript;rs = RSelenium::rsDriver(browser = "firefox")rsc = rs$clientrsc$navigate("https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8")# now get the page sourceht = rsc$getPageSource()url = rvest::read_html(ht[[1]])lectures = html_elements(url, css = '#video-title') # show how to get thislec_names = html_text2(lectures)lec_links = html_attr(lectures, "href")lec_links_full = paste0("https://www.youtube.com", lec_links)
# try one link# does not workurl2 = read_html(lec_links_full[1])x = html_elements(url2, css = "#info")# need thisrsc$navigate(lec_links_full[1])ht2 = rsc$getPageSource()ok2 <- rvest::read_html(ht2[[1]])# show how to get thisview = html_elements(ok2, css = ".ytd-video-view-count-renderer")view_count = html_text(view[1])view_countas.numeric(gsub(",| views", "", view_count))
# put it as a functionget_view = function(link){ rsc$navigate(link) url2 = rsc$getPageSource() Sys.sleep(1) url2 <- rvest::read_html(url2[[1]]) view = html_elements(url2, css = ".ytd-video-view-count-renderer") view_count = html_text(view[1]) view_count = as.integer(gsub(",| views", "", view_count)) return(view_count)}
# run itview_counts = data.frame(names = lec_names, views = NA_integer_)for(i in 1:length(lec_links_full)){ cat(lec_links_full[i], "\t") view_count = get_view(link = lec_links_full[i]) # for some reason, sometimes it takes multiple tries # while(length(view_count) == 0) # view_count = get_view(lec_links_full[i]) view_counts$views[i] = view_count}# save resultswrite.csv(view_counts, "view.csv")rs$server$stop() # close the server
The world is a messy place
Filtering errors
Replacing values (e.g., 9999)
Handling missing values and inconsistent labels
Parse into a useable format
80% of your time?!
awk
, sed
, grep
Data import & output (with R)
Data manipulation (with R)
view_counts = read.csv("view.csv", row.names = 1)DT::datatable(view_counts, options = list(pageLength = 6))
names | views | |
---|---|---|
1 | An Interview with Gilbert Strang on Teaching Linear Algebra | 531657 |
2 | 1. The Geometry of Linear Equations | 749756 |
3 | 2. Elimination with Matrices. | 1651140 |
4 | 3. Multiplication and Inverse Matrices | 1149974 |
5 | 4. Factorization into A = LU | 431759 |
6 | 5. Transposes, Permutations, Spaces R^n | 669672 |
(a = stringr::str_extract(string = view_counts$names, pattern = "^[b0-9]*"))
## [1] "" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" ## [13] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" ## [25] "24" "24b" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34"
a[a == "24b"] = 24.5a = as.numeric(a)view_counts$idx = aview_counts$names = stringr::str_replace(string = view_counts$names, pattern = "^[b0-9]*[.] ", replacement = "")head(view_counts$names)
## [1] "An Interview with Gilbert Strang on Teaching Linear Algebra"## [2] "The Geometry of Linear Equations" ## [3] "Elimination with Matrices." ## [4] "Multiplication and Inverse Matrices" ## [5] "Factorization into A = LU" ## [6] "Transposes, Permutations, Spaces R^n"
DT::datatable(view_counts, options = list(pageLength = 6))
names | views | idx | |
---|---|---|---|
1 | An Interview with Gilbert Strang on Teaching Linear Algebra | 531657 | |
2 | The Geometry of Linear Equations | 749756 | 1 |
3 | Elimination with Matrices. | 1651140 | 2 |
4 | Multiplication and Inverse Matrices | 1149974 | 3 |
5 | Factorization into A = LU | 431759 | 4 |
6 | Transposes, Permutations, Spaces R^n | 669672 | 5 |
Get to know your data better through visualization, clustering, dimensionality reducing, etc.
What are the different variables?
Their types, distributions, and range?
Relationships among them? Correlations?
Descriptive statistics?
head
, less
, tail
, etc.
Data visualization (with R, plot
, lattice
, ggplot2
)
Data description (with R, basic functions mean
, min
, max
, etc.)
plot(view_counts$idx, view_counts$views, type = "b", xlab = "Lecture number", ylab = "View counts")
plot(view_counts$idx, log10(view_counts$views), type = "b", xlab = "Lecture number", ylab = "View counts (log 10)")
All models are wrong, but some are useful.
To create an abstract or higher-level description of your data
To test hypotheses
To predict
With uncertainty
Dimension reducing, clustering, regression, classification
Statistical modeling (with R, lm
, glm
, lmer
, etc.)
Machine learning (with R, random forest, deep learning, etc.)
model_1 = lm(views ~ idx, data = view_counts)summary(model_1)
## ## Call:## lm(formula = views ~ idx, data = view_counts)## ## Residuals:## Min 1Q Median 3Q Max ## -279979 -108083 -45405 43491 912530 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 788058 76886 10.250 8.70e-12 ***## idx -24724 3806 -6.496 2.25e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 219300 on 33 degrees of freedom## (1 observation deleted due to missingness)## Multiple R-squared: 0.5612, Adjusted R-squared: 0.5479 ## F-statistic: 42.2 on 1 and 33 DF, p-value: 2.255e-07
plot(view_counts$idx, view_counts$views, type = "b", xlab = "Lecture number", ylab = "View counts")abline(model_1)
model_2 = lm(log10(views) ~ idx, data = view_counts)summary(model_2)
## ## Call:## lm(formula = log10(views) ~ idx, data = view_counts)## ## Residuals:## Min 1Q Median 3Q Max ## -0.89423 -0.12116 0.04193 0.17757 0.50399 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.959777 0.100523 59.287 < 2e-16 ***## idx -0.033306 0.004976 -6.694 1.27e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.2867 on 33 degrees of freedom## (1 observation deleted due to missingness)## Multiple R-squared: 0.5759, Adjusted R-squared: 0.563 ## F-statistic: 44.81 on 1 and 33 DF, p-value: 1.271e-07
plot(view_counts$idx, log10(view_counts$views), type = "b", xlab = "Lecture number", ylab = "View counts")abline(model_2)
The purpose is to gain insights from numbers
What have we learned?
What should we do next?
Disseminate results and communicate with others
Produce useful products
Domain expertises and intuition
Being skeptical (double check)
Communication skills (presentation, writing)
Reproducible reports (with Rmarkdown and other tools)
Define the question of interest
Get the data
Clean and prepare the data
Explore the data
Fit models to extract insights
Tell, explain, and illustrate results
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |