+ - 0:00:00
Notes for current slide
Notes for next slide

Data science project cycle

Introduction to Data Science (BIOL7800)

https://introdatasci.dlilab.com/

Daijiang Li

LSU

2023/08/24

1 / 25

Data Science Processes

  1. Define the question of interest

  2. Get the data

  3. Clean and prepare the data

  4. Explore the data

  5. Fit models to extract insights

  6. Tell, explain, and illustrate results

2 / 25

The OSEMN1 framework

1: Pronounced as 'awesome'

OSEMN

3 / 25

The OSEMN framework

  1. Obtaining data

  2. Scrubbing data

  3. Exploring data

  4. Modeling data

  5. iNterpreting data

Mason and Wiggins 2010

4 / 25

Obtaining data

After defining your question, the first step is to obtain data

Common sources

  • Query data from a database or API (e.g., MySQL, Twitter, GBIF)

  • Download data from another location (e.g., a server, ftp)

  • Extract data from other files (e.g., html webpage, spreadsheet)

  • Generate your own data (e.g., simulation, experiment)

Tools and skills

  • Relational database (e.g., SQLite, PostgreSQL, Spark), use API (e.g., R packages dbplyr, DBI)

  • Downloading data programmingly (web scrapping, curl, R packages httr, rvest)

  • Understanding of file system; decompress and manage files, etc.

5 / 25

API: application programming interface

Example

Q: How many viewers does each video have in this playlist (MIT Linear Algebra Spring 2005)? Do the view counts decline over time?

Data provided by Youtube

6 / 25

Get lecture names and links

library(rvest, warn.conflicts = FALSE)
library(RSelenium)
# to set up a server to run javascript;
rs = RSelenium::rsDriver(browser = "firefox")
rsc = rs$client
rsc$navigate("https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8")
# now get the page source
ht = rsc$getPageSource()
url = rvest::read_html(ht[[1]])
lectures = html_elements(url, css = '#video-title') # show how to get this
lec_names = html_text2(lectures)
lec_links = html_attr(lectures, "href")
lec_links_full = paste0("https://www.youtube.com", lec_links)
7 / 25

Try to get view count of one link

# try one link
# does not work
url2 = read_html(lec_links_full[1])
x = html_elements(url2, css = "#info")
# need this
rsc$navigate(lec_links_full[1])
ht2 = rsc$getPageSource()
ok2 <- rvest::read_html(ht2[[1]])
# show how to get this
view = html_elements(ok2, css = ".ytd-video-view-count-renderer")
view_count = html_text(view[1])
view_count
as.numeric(gsub(",| views", "", view_count))
8 / 25

Convert it to a function

# put it as a function
get_view = function(link){
rsc$navigate(link)
url2 = rsc$getPageSource()
Sys.sleep(1)
url2 <- rvest::read_html(url2[[1]])
view = html_elements(url2, css = ".ytd-video-view-count-renderer")
view_count = html_text(view[1])
view_count = as.integer(gsub(",| views", "", view_count))
return(view_count)
}
9 / 25

Get all view counts

# run it
view_counts = data.frame(names = lec_names, views = NA_integer_)
for(i in 1:length(lec_links_full)){
cat(lec_links_full[i], "\t")
view_count = get_view(link = lec_links_full[i])
# for some reason, sometimes it takes multiple tries
# while(length(view_count) == 0)
# view_count = get_view(lec_links_full[i])
view_counts$views[i] = view_count
}
# save results
write.csv(view_counts, "view.csv")
rs$server$stop() # close the server
10 / 25

Scrubbing (cleaning) data

The world is a messy place

Common operations

  • Filtering errors

  • Replacing values (e.g., 9999)

  • Handling missing values and inconsistent labels

  • Parse into a useable format

  • 80% of your time?!

Tools and skills

  • awk, sed, grep

  • Data import & output (with R)

  • Data manipulation (with R)

11 / 25

Example continue

view_counts = read.csv("view.csv", row.names = 1)
DT::datatable(view_counts, options = list(pageLength = 6))
12 / 25

Extract lecture numbers?

(a = stringr::str_extract(string = view_counts$names,
pattern = "^[b0-9]*"))
## [1] "" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
## [13] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23"
## [25] "24" "24b" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34"
a[a == "24b"] = 24.5
a = as.numeric(a)
view_counts$idx = a
view_counts$names = stringr::str_replace(string = view_counts$names,
pattern = "^[b0-9]*[.] ",
replacement = "")
head(view_counts$names)
## [1] "An Interview with Gilbert Strang on Teaching Linear Algebra"
## [2] "The Geometry of Linear Equations"
## [3] "Elimination with Matrices."
## [4] "Multiplication and Inverse Matrices"
## [5] "Factorization into A = LU"
## [6] "Transposes, Permutations, Spaces R^n"
13 / 25

Data are ready?

DT::datatable(view_counts, options = list(pageLength = 6))
14 / 25

Exploring data

Get to know your data better through visualization, clustering, dimensionality reducing, etc.

Common inspections

  • What are the different variables?

  • Their types, distributions, and range?

  • Relationships among them? Correlations?

  • Descriptive statistics?

Tools and skills

  • head, less, tail, etc.

  • Data visualization (with R, plot, lattice, ggplot2)

  • Data description (with R, basic functions mean, min, max, etc.)

15 / 25

Example continue

plot(view_counts$idx, view_counts$views, type = "b",
xlab = "Lecture number", ylab = "View counts")

16 / 25

Example continue

plot(view_counts$idx, log10(view_counts$views), type = "b",
xlab = "Lecture number", ylab = "View counts (log 10)")

17 / 25

Modeling data

All models are wrong, but some are useful.

Common tasks

  • To create an abstract or higher-level description of your data

  • To test hypotheses

  • To predict

  • With uncertainty

Tools and skills

  • Dimension reducing, clustering, regression, classification

  • Statistical modeling (with R, lm, glm, lmer, etc.)

  • Machine learning (with R, random forest, deep learning, etc.)

18 / 25

Example continue

model_1 = lm(views ~ idx, data = view_counts)
summary(model_1)
##
## Call:
## lm(formula = views ~ idx, data = view_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -279979 -108083 -45405 43491 912530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 788058 76886 10.250 8.70e-12 ***
## idx -24724 3806 -6.496 2.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219300 on 33 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5612, Adjusted R-squared: 0.5479
## F-statistic: 42.2 on 1 and 33 DF, p-value: 2.255e-07
19 / 25

Example continue

plot(view_counts$idx, view_counts$views, type = "b",
xlab = "Lecture number", ylab = "View counts")
abline(model_1)

20 / 25

Example continue

model_2 = lm(log10(views) ~ idx, data = view_counts)
summary(model_2)
##
## Call:
## lm(formula = log10(views) ~ idx, data = view_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89423 -0.12116 0.04193 0.17757 0.50399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.959777 0.100523 59.287 < 2e-16 ***
## idx -0.033306 0.004976 -6.694 1.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2867 on 33 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5759, Adjusted R-squared: 0.563
## F-statistic: 44.81 on 1 and 33 DF, p-value: 1.271e-07
21 / 25

Example continue

plot(view_counts$idx, log10(view_counts$views), type = "b",
xlab = "Lecture number", ylab = "View counts")
abline(model_2)

22 / 25

iNterpreting data

The purpose is to gain insights from numbers

Common tasks

  • What have we learned?

  • What should we do next?

  • Disseminate results and communicate with others

  • Produce useful products

Tools and skills

  • Domain expertises and intuition

  • Being skeptical (double check)

  • Communication skills (presentation, writing)

  • Reproducible reports (with Rmarkdown and other tools)

23 / 25

Example continue

??

24 / 25

Doing data science is an iterative and non-linear process!

25 / 25

Data Science Processes

  1. Define the question of interest

  2. Get the data

  3. Clean and prepare the data

  4. Explore the data

  5. Fit models to extract insights

  6. Tell, explain, and illustrate results

2 / 25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow