In regression models (and plots sometimes), we need to use formulas to express relationships between variables. R provides a formula class for this purpose.
sample_f = as.formula(y ~ x1 + x2 + x3)
sample_f
## y ~ x1 + x2 + x3
class(sample_f)
## [1] "formula"
typeof(sample_f)
## [1] "language"
Here is an explanation of the meaning of different items in formulas:
~
: to show relationship between the response variable
(to the left) and the stimulus variables (to the right)+
: to express a linear relationship between variables,
e.g., x1 + x2 + x3
0
or -1
: when added to a formular,
indicate that no intercept term should be included|
: to specify conditioning or grouping variables;
e.g. (1 | species)
I()
: identity function, to indicate that the enclosed
expression should be interpreted by its arithmetic meaning; e.g.,
I(a+b)
means that a variable “a plus b”, not
a + b
“a and b”.*
: indicates interactions between variables:
y ~ x1 * x2
is equivalent to
y ~ x1 + x2 + x1:x2
^
: indicates crossing to a specific degree:
y ~ (x1 + x2)^2
is equivalent to
y ~ (x1 + x2) * (x1 + x2)
y ~ log(x)
, y ~ poly(x)
, y ~ s(x)
for gam
.In most cases, we have more than one predictor variable that needs to be included in the regression model. We call such models multiple regression. When predictors include both numeric and categorical variables, this is ANCOVA (Analysis of co-variance).
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \varepsilon\]
\(\beta_0\): intercept, expected value of Y when \(X_1=X_2=...=X_p=0\)
\(\beta_j\): slope (partial regression coefficient), expected change in Y for one-unit increase in \(X_j\) with all other \(X_{j'}\ (j'\neq j)\) held constant, where \(j=1,2,...,p\).
Same assumptions as simple linear regression (SLR).
Estimations of parameters, confidence intervals calculations, \(R^2\), residual types, etc., are largely similar (but more complicated) as SLR.
Order of MLR model fitting:
lm(y ~ x1 + x2)
and lm(y ~ x2 + x1)
would have
the same output when use summary()
, but this is not the
case when use anova()
because the tested hypotheses are
different. For details, Google different types of sums of squares (SS).
By default, R calculates Type I SS (sequential).
Rye = read.csv("https://git.psu.edu/stat/essential-r/-/raw/master/Data/Rye-ANCOVA-2008.csv")
head(Rye)
## Plant Term Rep RyeDMg Pdate Tdate GrDay Tdate2
## 1 P1 T1 I 443.0 257 20.0 233.0 20.0
## 2 P2 T1 I 287.0 288 20.0 202.0 20.0
## 3 P1 T2 I 631.5 257 32.5 245.5 32.5
## 4 P2 T2 I 513.0 288 32.5 214.5 32.5
## 5 P1 T3 I 771.5 257 39.5 252.5 39.5
## 6 P2 T3 I 570.5 288 39.5 221.5 39.5
summary(lm(RyeDMg ~ Tdate, data = Rye))
##
## Call:
## lm(formula = RyeDMg ~ Tdate, data = Rye)
##
## Residuals:
## Min 1Q Median 3Q Max
## -195.779 -87.402 2.592 82.281 164.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.232 86.083 1.118 0.276
## Tdate 16.963 2.715 6.249 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.3 on 22 degrees of freedom
## Multiple R-squared: 0.6396, Adjusted R-squared: 0.6232
## F-statistic: 39.04 on 1 and 22 DF, p-value: 2.739e-06
summary(lm(RyeDMg ~ Plant, data = Rye))
##
## Call:
## lm(formula = RyeDMg ~ Plant, data = Rye)
##
## Residuals:
## Min 1Q Median 3Q Max
## -256.63 -115.16 24.06 83.00 332.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 689.25 46.68 14.766 6.73e-13 ***
## PlantP2 -145.63 66.01 -2.206 0.0381 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 161.7 on 22 degrees of freedom
## Multiple R-squared: 0.1811, Adjusted R-squared: 0.1439
## F-statistic: 4.867 on 1 and 22 DF, p-value: 0.03812
summary(lm(RyeDMg ~ Tdate + Plant, data = Rye))
##
## Call:
## lm(formula = RyeDMg ~ Tdate + Plant, data = Rye)
##
## Residuals:
## Min 1Q Median 3Q Max
## -122.967 -65.880 -5.537 58.567 183.033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 169.04 64.12 2.636 0.015429 *
## Tdate 16.96 1.96 8.656 2.27e-08 ***
## PlantP2 -145.62 31.61 -4.607 0.000152 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.43 on 21 degrees of freedom
## Multiple R-squared: 0.8208, Adjusted R-squared: 0.8037
## F-statistic: 48.08 on 2 and 21 DF, p-value: 1.45e-08
model1 <- lm(RyeDMg ~ Tdate + Plant + Tdate:Plant, data = Rye)
summary(model1)
##
## Call:
## lm(formula = RyeDMg ~ Tdate + Plant + Tdate:Plant, data = Rye)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.928 -59.276 -5.537 59.179 169.072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 217.514 88.641 2.454 0.0234 *
## Tdate 15.383 2.795 5.503 2.19e-05 ***
## PlantP2 -242.565 125.358 -1.935 0.0673 .
## Tdate:PlantP2 3.161 3.953 0.800 0.4333
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.1 on 20 degrees of freedom
## Multiple R-squared: 0.8263, Adjusted R-squared: 0.8003
## F-statistic: 31.71 on 3 and 20 DF, p-value: 8.487e-08
plot(RyeDMg ~ Tdate, data = Rye, col = as.factor(Rye$Plant))
mcf <- summary(model1)$coef[, 1] #coefficient estimates only
abline(a = mcf[1], b = mcf[2])
abline(a = mcf[1] + mcf[3], b = mcf[2] + mcf[4], lty = 2, col = "red")
library(tidyverse, warn.conflicts = FALSE)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
ggplot(Rye, aes(x = Tdate, y = RyeDMg, color = Plant)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Multicollinearity arises when there is a strong linear relationship among the predictor variables (X1, X2, …, Xp).
Effect of multicollinearity on the inference of the regression coefficients:
Variance Inflation Factor (VIF):
For \(j = 1, ..., p\), a variance inflation factor (VIF) for \(\hat{\beta_j}\) is defined as
\[VIF_j=\frac{1}{1-R_j^2}\]
where \(R_j^2\) is the coefficient of multiple determination when \(X_j\) is regressed on the other \(p − 1\ X_{j'}\)’s predictor variables.
If the largest VIF value among \(VIF_j\) (j = 1, …, p) is larger than 10, multicollinearity may have a large impact on the inference. If the mean VIF values of \(VIF_j\)’s is considerably greater than 1, there may be serious multicollinearity problems.
car::vif()
Remedial Measures for Multicollinearity:
There are many different ways for testing different combinations of
variables in models (forward selection, backward selection, both
directions, …). But there are critiques of all these methods (biased
p-values, etc.). In R, you can use the step()
function to
conduct model selection.
args(step)
## function (object, scope, scale = 0, direction = c("both", "backward",
## "forward"), trace = 1, keep = NULL, steps = 1000, k = 2,
## ...)
## NULL
The most defensible practice is to build a model based on hypothesis.
General principles: Use simple models. Let science and common sense guide approach. Sometimes there is no single “best” model. There may not be enough information in the data to tell what the truth is exactly.
GLMs are a generlization of linear models and can be used to fit linear regression models, logistic regression models, Poisson regression models, etc.
In R, you can model all of these different types of models using the
glm
function.
args(glm)
## function (formula, family = gaussian, data, weights, subset,
## na.action, start = NULL, etastart, mustart, offset, control = list(...),
## model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, singular.ok = TRUE,
## contrasts = NULL, ...)
## NULL
The following function families are available in R:
binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")
quasi(link = "identity", variance = "constant")
quasibinomial(link = "logit")
quasipoisson(link = "log")
A brief introduction to mixed effects modelling and multi-model inference in ecology