Data science project cycle

Introduction to Data Science (BIOL7800)

https://introdatasci.dlilab.com/

Daijiang Li

LSU

2023/08/24

1 / 25

Data Science Processes

Define the question of interest
Get the data
Clean and prepare the data
Explore the data
Fit models to extract insights
Tell, explain, and illustrate results

2 / 25

The OSEMN¹ framework

1: Pronounced as 'awesome'

OSEMN

3 / 25

The OSEMN framework

Obtaining data
Scrubbing data
Exploring data
Modeling data
iNterpreting data

Mason and Wiggins 2010

4 / 25

Obtaining data

After defining your question, the first step is to obtain data

Common sources

Query data from a database or API (e.g., MySQL, Twitter, GBIF)
Download data from another location (e.g., a server, ftp)
Extract data from other files (e.g., html webpage, spreadsheet)
Generate your own data (e.g., simulation, experiment)

Tools and skills

Relational database (e.g., SQLite, PostgreSQL, Spark), use API (e.g., R packages dbplyr, DBI)
Downloading data programmingly (web scrapping, curl, R packages httr, rvest)
Understanding of file system; decompress and manage files, etc.

5 / 25

API: application programming interface

Example

Q: How many viewers does each video have in this playlist (MIT Linear Algebra Spring 2005)? Do the view counts decline over time?

Data provided by Youtube

6 / 25

Get lecture names and links

library(rvest, warn.conflicts = FALSE)
library(RSelenium)
# to set up a server to run javascript;
rs = RSelenium::rsDriver(browser = "firefox")
rsc = rs$client
rsc$navigate("https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8")
# now get the page source
ht = rsc$getPageSource()
url = rvest::read_html(ht[[1]])
lectures = html_elements(url, css = '#video-title') # show how to get this
lec_names = html_text2(lectures)
lec_links = html_attr(lectures, "href")
lec_links_full = paste0("https://www.youtube.com", lec_links)

7 / 25

Try to get view count of one link

# try one link
# does not work
url2 = read_html(lec_links_full[1])
x = html_elements(url2, css = "#info")
# need this
rsc$navigate(lec_links_full[1])
ht2 = rsc$getPageSource()
ok2 <- rvest::read_html(ht2[[1]])
# show how to get this
view = html_elements(ok2, css = ".ytd-video-view-count-renderer")
view_count = html_text(view[1])
view_count
as.numeric(gsub(",| views", "", view_count))

8 / 25

Convert it to a function

# put it as a function
get_view = function(link){
  rsc$navigate(link)
  url2 = rsc$getPageSource()
  Sys.sleep(1) 
  url2 <- rvest::read_html(url2[[1]])
  view = html_elements(url2, css = ".ytd-video-view-count-renderer")
  view_count = html_text(view[1])
  view_count = as.integer(gsub(",| views", "", view_count))
  return(view_count)
}

9 / 25

Get all view counts

# run it
view_counts = data.frame(names = lec_names, views = NA_integer_)
for(i in 1:length(lec_links_full)){
  cat(lec_links_full[i], "\t")
  view_count = get_view(link = lec_links_full[i])
  # for some reason, sometimes it takes multiple tries
  # while(length(view_count) == 0)
  #   view_count = get_view(lec_links_full[i])
  view_counts$views[i] = view_count
}
# save results
write.csv(view_counts, "view.csv")
rs$server$stop() # close the server

10 / 25

Scrubbing (cleaning) data

The world is a messy place

Common operations

Filtering errors
Replacing values (e.g., 9999)
Handling missing values and inconsistent labels
Parse into a useable format
80% of your time?!

Tools and skills

awk, sed, grep
Data import & output (with R)
Data manipulation (with R)

11 / 25

Example continue

view_counts = read.csv("view.csv", row.names = 1)
DT::datatable(view_counts, options = list(pageLength = 6))

Show entries

Search:

	names	views
1	An Interview with Gilbert Strang on Teaching Linear Algebra	531657
2	1. The Geometry of Linear Equations	749756
3	2. Elimination with Matrices.	1651140
4	3. Multiplication and Inverse Matrices	1149974
5	4. Factorization into A = LU	431759
6	5. Transposes, Permutations, Spaces R^n	669672

Showing 1 to 6 of 36 entries

Previous1 2 3 4 5 6Next

12 / 25

Extract lecture numbers?

(a = stringr::str_extract(string = view_counts$names,
                          pattern = "^[b0-9]*"))

##  [1] ""    "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11" 
## [13] "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23" 
## [25] "24"  "24b" "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"

a[a == "24b"] = 24.5
a = as.numeric(a)
view_counts$idx = a
view_counts$names = stringr::str_replace(string = view_counts$names, 
                                         pattern = "^[b0-9]*[.] ", 
                                         replacement = "")
head(view_counts$names)

## [1] "An Interview with Gilbert Strang on Teaching Linear Algebra"
## [2] "The Geometry of Linear Equations"                           
## [3] "Elimination with Matrices."                                 
## [4] "Multiplication and Inverse Matrices"                        
## [5] "Factorization into A = LU"                                  
## [6] "Transposes, Permutations, Spaces R^n"

13 / 25

Data are ready?

DT::datatable(view_counts, options = list(pageLength = 6))

Show entries

Search:

	names	views	idx
1	An Interview with Gilbert Strang on Teaching Linear Algebra	531657
2	The Geometry of Linear Equations	749756	1
3	Elimination with Matrices.	1651140	2
4	Multiplication and Inverse Matrices	1149974	3
5	Factorization into A = LU	431759	4
6	Transposes, Permutations, Spaces R^n	669672	5

Showing 1 to 6 of 36 entries

Previous1 2 3 4 5 6Next

14 / 25

Exploring data

Get to know your data better through visualization, clustering, dimensionality reducing, etc.

Common inspections

What are the different variables?
Their types, distributions, and range?
Relationships among them? Correlations?
Descriptive statistics?

Tools and skills

head, less, tail, etc.
Data visualization (with R, plot, lattice, ggplot2)
Data description (with R, basic functions mean, min, max, etc.)

15 / 25

Example continue

plot(view_counts$idx, view_counts$views, type = "b",
     xlab = "Lecture number", ylab = "View counts")

16 / 25

Example continue

plot(view_counts$idx, log10(view_counts$views), type = "b",
     xlab = "Lecture number", ylab = "View counts (log 10)")

17 / 25

Modeling data

All models are wrong, but some are useful.

Common tasks

To create an abstract or higher-level description of your data
To test hypotheses
To predict
With uncertainty

Tools and skills

Dimension reducing, clustering, regression, classification
Statistical modeling (with R, lm, glm, lmer, etc.)
Machine learning (with R, random forest, deep learning, etc.)

18 / 25

Example continue

model_1 = lm(views ~ idx, data = view_counts)
summary(model_1)

## 
## Call:
## lm(formula = views ~ idx, data = view_counts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -279979 -108083  -45405   43491  912530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   788058      76886  10.250 8.70e-12 ***
## idx           -24724       3806  -6.496 2.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219300 on 33 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5612,    Adjusted R-squared:  0.5479 
## F-statistic:  42.2 on 1 and 33 DF,  p-value: 2.255e-07

19 / 25

Example continue

plot(view_counts$idx, view_counts$views, type = "b",
     xlab = "Lecture number", ylab = "View counts")
abline(model_1)

20 / 25

Example continue

model_2 = lm(log10(views) ~ idx, data = view_counts)
summary(model_2)

## 
## Call:
## lm(formula = log10(views) ~ idx, data = view_counts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89423 -0.12116  0.04193  0.17757  0.50399 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.959777   0.100523  59.287  < 2e-16 ***
## idx         -0.033306   0.004976  -6.694 1.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 33 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5759,    Adjusted R-squared:  0.563 
## F-statistic: 44.81 on 1 and 33 DF,  p-value: 1.271e-07

21 / 25

Example continue

plot(view_counts$idx, log10(view_counts$views), type = "b",
     xlab = "Lecture number", ylab = "View counts")
abline(model_2)

22 / 25

iNterpreting data

The purpose is to gain insights from numbers

Common tasks

What have we learned?
What should we do next?
Disseminate results and communicate with others
Produce useful products

Tools and skills

Domain expertises and intuition
Being skeptical (double check)
Communication skills (presentation, writing)
Reproducible reports (with Rmarkdown and other tools)

23 / 25

Example continue??24 / 25

Doing data science is an iterative and non-linear process!25 / 25

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Data science project cycle

Introduction to Data Science (BIOL7800)

Daijiang Li

LSU

2023/08/24

Data Science Processes

The OSEMN1 framework

The OSEMN framework

Obtaining data

Common sources

Tools and skills

Example

Q: How many viewers does each video have in this playlist (MIT Linear Algebra Spring 2005)? Do the view counts decline over time?

Data provided by Youtube

Get lecture names and links

Try to get view count of one link

Convert it to a function

Get all view counts

Scrubbing (cleaning) data

Common operations

Tools and skills

Example continue

Extract lecture numbers?

Data are ready?

Exploring data

Common inspections

Tools and skills

Example continue

Example continue

Modeling data

Common tasks

Tools and skills

Example continue

Example continue

Example continue

Example continue

iNterpreting data

Common tasks

Tools and skills

Example continue

??

Doing data science is an iterative and non-linear process!

Data Science Processes

Help

The OSEMN¹ framework