ggplot()
+ aes()
+ geom_*()
or
ggplot(data, aes()) + geom_*()
R includes three major different packages (systems) for plotting
data: graphics
, lattice
, and
ggplot2
. graphics
comes with R and is referred
as base R plotting. It is easy to customize or modify charts with the
graphics
package, or to interact with them on the screen
(interactive plot). The lattice
package contains an
alternative set of functions for plotting data. Lattice
graphics are well suited for splitting data by a conditioning variable
and they used to be quite popular. The third package (or a system) is
ggplot2
, with gg
stands for
grammar of graphics
. ggplot2
is probably the
most popular package for plotting nowadays; hundreds of extra packages
build on it to extend its capacity, making it a well developed system
for plotting. We thus will focus mostly on ggplot2
.
Here, I will only introduce several commonly used plotting function
from the base R graphics
package. They are useful to
quickly plot something for data exploration. I personally use them
frequently during data analysis. However, I now produce almost all
figures in publications through ggplot2
.
To show a scatter plot, use the plot
function.
?plot.default
plot(x, y = NULL, type = "p", xlim = NULL, ylim = NULL, main = NULL, sub = NULL,
xlab = NULL, ylab = NULL, axes = TRUE, ...)
Plot is a generic function (you can “plot” many different types of
objects); plot
can draw many types of objects, including
vectors, tables, and time series. If you need to plot something
generated by another package, chances are that package will work with
plot
to do that. For example, if we have a phylogeny, and
plot
can draw it.
## Loading required package: ape
When performing data analysis, it’s often very important to understand the shape of a data distribution. Looking at a distribution can tell you whether there are outliers in the data, or whether a certain modeling technique will work on your data, or simply how many observations are within a certain range of values.
The best known technique for visualizing a distribution is the histogram.
Another very useful way to visualize a distribution is a box plot. A box plot is a compact way to show the distribution of a variable. The box shows the interquartile range. The interquartile range contains values between the 25th and 75th percentile; the line inside the box shows the median. The two “whiskers” on either side of the box show the adjacent values. The adjacent values are intended to show extreme values, but they don’t always extend to the absolute maximum or minimum value. The upper adjacent value is the value of the largest observation that is less than or equal to the upper quartile plus 1.5 times the length of the interquartile range; the lower adjacent value is the value of the smallest observation that is greater than or equal to the lower quartile less 1.5 times the length of the interquartile range. When there are values far outside the range we would expect for normally distributed data, those outlying values are plotted separately.
Graphics in R are plotted on a graphics device. You can generate
graphics in common formats using the bmp
,
jpeg
, png
, and tiff
devices.
Other devices include postscript
, pdf
,
pictex
(to generate LaTeX/PicTeX), xfig
, and
bitmap
.
We can save multiple figures in one file, which can be useful when we explore the data.
There is NO WAY to cover ggplot2
in one
lecture! I list some resources here so that you can learn it later.
ggplot()
+ aes()
+ geom_*()
or
ggplot(data, aes()) + geom_*()
ggplot2
is a package developed by Hadley Wickham (yes,
he also developed the whole tidyverse
with others) based on
the idea of grammar of graphics – a concept created by Leland Wilkinson
(Wilkinson 2005, a book). ggplot2
graphics are built up
from modular logical pieces, we can add layers with +
.
ggplot2
supports a continuum of expertise. One can get
started right away and can also build complex, publication quality
figures with some extra effort and prctice.
From the reading chapter: The components of ggplot2’s grammar of graphics are
ggplot2
works with data frames
only, no vectors.Some terminology
ggplot
- The main function where you specify the
dataset and variables to plotgeoms
- geometric objects
geom_point()
, geom_bar()
,
geom_density()
, geom_line()
,
geom_area()
, etc.aes
- aesthetics
scales
Define how your data will be plotted
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# can also set aes() within ggplot()
# this is ** Global setting **
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
# ggplot2 must work with a data frame, not a vector!
ggplot(as.data.frame(d2), aes(x = d2)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(as.data.frame(d2), aes(x = d2)) +
geom_histogram(binwidth = 1, color = "lightblue", fill = "blue")
## # A tibble: 3 Ă— 2
## Species l
## <fct> <dbl>
## 1 setosa 250.
## 2 versicolor 297.
## 3 virginica 329.
## # A tibble: 600 Ă— 3
## Species name value
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3
## 7 setosa Petal.Length 1.4
## 8 setosa Petal.Width 0.2
## 9 setosa Sepal.Length 4.7
## 10 setosa Sepal.Width 3.2
## # â„ą 590 more rows
Use the d3
dataset to generate the plot below:
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use color-blind friendly, print friendly colors!
Some useful links:
ggplot(d3, aes(x = Species, y = value, fill = name)) +
geom_col(position = "dodge") +
scale_fill_brewer(palette = "Set1")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(Species ~ .)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
facet_grid(Species ~ .) +
theme(legend.position = "none") # remove legend
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ Species) +
theme(legend.position = "none") # remove legend
## `geom_smooth()` using formula = 'y ~ x'
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ Species) +
labs(x = "Species length",
y = "Sepal width",
color = "Species names") +
theme(legend.position = "bottom",
legend.key = element_rect(fill = NA),
strip.background = element_rect(fill = NA)
)
## `geom_smooth()` using formula = 'y ~ x'
p = ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point()
p + theme_bw() # comes with ggplot2
There is a ggthemes
package for you to use.