Exploratory Data Analysis

Daijiang Li

11/02/2023

What is exploratory data analysis (EDA)?

Looking at data to see what it seems to say
—- John Tukey

Exploratory data analysis is all about creative investigation to generate new ideas (instead of answering some specific questions or “confirmatory” data analysis).

Uses of EDA:

Questions you should ask when conduct EDA:

Some details

Useful functions to get an overview of datasets:

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Also, make lots of plots because summary won’t tell everything! Scatter plots, boxplots, histograms, etc. Pay attention to scales (you may want to take logs when data differ order of magnitude).

Anscombe’s quartet

Anscombe’s quartet: All four sets are identical when examined using simple summary statistics, but vary considerably when graphed
Anscombe’s quartet: All four sets are identical when examined using simple summary statistics, but vary considerably when graphed

Try to ask these questions at the begining of EDA: what might have gone wrong? How can we reveal such problems?

Pay attention to:

Don’t trust anyone, including yourself: too good to be true? Artifacts?

Even more importantly, don’t stop from there, follow up and try to figure out the reasons behind artifacts: they might be the most interesting and unexpected results.

Important principles:

References: