class: center, middle, inverse, title-slide .title[ # Data science project cycle ] .subtitle[ ## Introduction to Data Science (BIOL7800)
https://introdatasci.dlilab.com/
] .author[ ### Daijiang Li ] .institute[ ### LSU ] .date[ ### 2023/08/24 ] --- # Data Science Processes .font200[ 1. Define the question of interest 2. Get the data 3. Clean and prepare the data 4. Explore the data 5. Fit models to extract insights 6. Tell, explain, and illustrate results ] --- # The OSEMN<sup>1</sup> framework .footnote[1: Pronounced as 'awesome'] ![OSEMN](https://miro.medium.com/max/1400/1*eE8DP4biqtaIK3aIy1S2zA.png) --- # The OSEMN framework .font200[ 1. Obtaining data 2. Scrubbing data 3. Exploring data 4. Modeling data 5. iNterpreting data ] .right[[Mason and Wiggins 2010](http://www.dataists.com/2010/09/a-taxonomy-of-data-science/)] --- # Obtaining data .font200[After defining your question, the first step is to obtain data] .pull-left[ ### Common sources .font150[ - Query data from a database or API (e.g., MySQL, Twitter, GBIF) - Download data from another location (e.g., a server, ftp) - Extract data from other files (e.g., html webpage, spreadsheet) - Generate your own data (e.g., simulation, experiment) ] ] .pull-right[ .font150[ ### Tools and skills - Relational database (e.g., SQLite, PostgreSQL, Spark), use API (e.g., R packages `dbplyr`, `DBI`) - Downloading data programmingly (web scrapping, `curl`, R packages `httr`, `rvest`) - Understanding of file system; decompress and manage files, etc. ] ] ??? API: application programming interface --- # Example ## Q: How many viewers does each video have in [this playlist](https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8) (MIT Linear Algebra Spring 2005)? Do the view counts decline over time? ## Data provided by Youtube --- # Get lecture names and links <!-- debug: sel <- wdman::selenium(); sel$log() --> ```r library(rvest, warn.conflicts = FALSE) library(RSelenium) # to set up a server to run javascript; rs = RSelenium::rsDriver(browser = "firefox") rsc = rs$client rsc$navigate("https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8") # now get the page source ht = rsc$getPageSource() url = rvest::read_html(ht[[1]]) lectures = html_elements(url, css = '#video-title') # show how to get this lec_names = html_text2(lectures) lec_links = html_attr(lectures, "href") lec_links_full = paste0("https://www.youtube.com", lec_links) ``` --- # Try to get view count of one link ```r # try one link # does not work url2 = read_html(lec_links_full[1]) x = html_elements(url2, css = "#info") # need this rsc$navigate(lec_links_full[1]) ht2 = rsc$getPageSource() ok2 <- rvest::read_html(ht2[[1]]) # show how to get this view = html_elements(ok2, css = ".ytd-video-view-count-renderer") view_count = html_text(view[1]) view_count as.numeric(gsub(",| views", "", view_count)) ``` --- # Convert it to a function ```r # put it as a function get_view = function(link){ rsc$navigate(link) url2 = rsc$getPageSource() Sys.sleep(1) url2 <- rvest::read_html(url2[[1]]) view = html_elements(url2, css = ".ytd-video-view-count-renderer") view_count = html_text(view[1]) view_count = as.integer(gsub(",| views", "", view_count)) return(view_count) } ``` --- # Get all view counts ```r # run it view_counts = data.frame(names = lec_names, views = NA_integer_) for(i in 1:length(lec_links_full)){ cat(lec_links_full[i], "\t") view_count = get_view(link = lec_links_full[i]) # for some reason, sometimes it takes multiple tries # while(length(view_count) == 0) # view_count = get_view(lec_links_full[i]) view_counts$views[i] = view_count } # save results write.csv(view_counts, "view.csv") rs$server$stop() # close the server ``` --- # Scrubbing (cleaning) data .font200[The world is a messy place] .pull-left[ ### Common operations .font150[ - Filtering errors - Replacing values (e.g., 9999) - Handling missing values and inconsistent labels - Parse into a useable format - .red[80% of your time?!] ] ] .pull-right[ .font150[ ### Tools and skills - `awk`, `sed`, `grep` - Data import & output (with R) - Data manipulation (with R) ] ] --- # Example continue ```r view_counts = read.csv("view.csv", row.names = 1) DT::datatable(view_counts, options = list(pageLength = 6)) ```
--- # Extract lecture numbers? ```r (a = stringr::str_extract(string = view_counts$names, pattern = "^[b0-9]*")) ``` ``` ## [1] "" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" ## [13] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" ## [25] "24" "24b" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" ``` ```r a[a == "24b"] = 24.5 a = as.numeric(a) view_counts$idx = a view_counts$names = stringr::str_replace(string = view_counts$names, pattern = "^[b0-9]*[.] ", replacement = "") head(view_counts$names) ``` ``` ## [1] "An Interview with Gilbert Strang on Teaching Linear Algebra" ## [2] "The Geometry of Linear Equations" ## [3] "Elimination with Matrices." ## [4] "Multiplication and Inverse Matrices" ## [5] "Factorization into A = LU" ## [6] "Transposes, Permutations, Spaces R^n" ``` --- # Data are ready? ```r DT::datatable(view_counts, options = list(pageLength = 6)) ```
--- # Exploring data .font200[Get to know your data better through visualization, clustering, dimensionality reducing, etc.] .pull-left[ ### Common inspections .font150[ - What are the different variables? - Their types, distributions, and range? - Relationships among them? Correlations? - Descriptive statistics? ] ] .pull-right[ .font150[ ### Tools and skills - `head`, `less`, `tail`, etc. - Data visualization (with R, `plot`, `lattice`, `ggplot2`) - Data description (with R, basic functions `mean`, `min`, `max`, etc.) ] ] --- # Example continue ```r plot(view_counts$idx, view_counts$views, type = "b", xlab = "Lecture number", ylab = "View counts") ``` <img src="presentation_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" /> --- # Example continue ```r plot(view_counts$idx, log10(view_counts$views), type = "b", xlab = "Lecture number", ylab = "View counts (log 10)") ``` <img src="presentation_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" /> --- # Modeling data .font200[All models are wrong, but some are useful.] .pull-left[ ### Common tasks .font150[ - To create an abstract or higher-level description of your data - To test hypotheses - To predict - With uncertainty ] ] .pull-right[ .font150[ ### Tools and skills - Dimension reducing, clustering, regression, classification - Statistical modeling (with R, `lm`, `glm`, `lmer`, etc.) - Machine learning (with R, random forest, deep learning, etc.) ] ] --- # Example continue ```r model_1 = lm(views ~ idx, data = view_counts) summary(model_1) ``` ``` ## ## Call: ## lm(formula = views ~ idx, data = view_counts) ## ## Residuals: ## Min 1Q Median 3Q Max ## -279979 -108083 -45405 43491 912530 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 788058 76886 10.250 8.70e-12 *** ## idx -24724 3806 -6.496 2.25e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 219300 on 33 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.5612, Adjusted R-squared: 0.5479 ## F-statistic: 42.2 on 1 and 33 DF, p-value: 2.255e-07 ``` --- # Example continue ```r plot(view_counts$idx, view_counts$views, type = "b", xlab = "Lecture number", ylab = "View counts") abline(model_1) ``` <img src="presentation_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> --- # Example continue ```r model_2 = lm(log10(views) ~ idx, data = view_counts) summary(model_2) ``` ``` ## ## Call: ## lm(formula = log10(views) ~ idx, data = view_counts) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.89423 -0.12116 0.04193 0.17757 0.50399 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.959777 0.100523 59.287 < 2e-16 *** ## idx -0.033306 0.004976 -6.694 1.27e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2867 on 33 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.5759, Adjusted R-squared: 0.563 ## F-statistic: 44.81 on 1 and 33 DF, p-value: 1.271e-07 ``` --- # Example continue ```r plot(view_counts$idx, log10(view_counts$views), type = "b", xlab = "Lecture number", ylab = "View counts") abline(model_2) ``` <img src="presentation_files/figure-html/unnamed-chunk-13-1.png" width="80%" style="display: block; margin: auto;" /> --- # iNterpreting data .font200[The purpose is to gain insights from numbers] .pull-left[ ### Common tasks .font150[ - What have we learned? - What should we do next? - Disseminate results and communicate with others - Produce useful products ] ] .pull-right[ .font150[ ### Tools and skills - Domain expertises and intuition - Being skeptical (double check) - Communication skills (presentation, writing) - Reproducible reports (with Rmarkdown and other tools) ] ] --- # Example continue ## ?? --- class: center, middle, inverse # Doing data science is an .red[iterative and non-linear] process!