Formulas

In regression models (and plots sometimes), we need to use formulas to express relationships between variables. R provides a formula class for this purpose.

sample_f = as.formula(y ~ x1 + x2 + x3)
sample_f
## y ~ x1 + x2 + x3
class(sample_f)
## [1] "formula"
typeof(sample_f)
## [1] "language"

Here is an explanation of the meaning of different items in formulas:

Multiple linear regression (MLR)

In most cases, we have more than one predictor variable that needs to be included in the regression model. We call such models multiple regression. When predictors include both numeric and categorical variables, this is ANCOVA (Analysis of co-variance).

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \varepsilon\]

\(\beta_0\): intercept, expected value of Y when \(X_1=X_2=...=X_p=0\)

\(\beta_j\): slope (partial regression coefficient), expected change in Y for one-unit increase in \(X_j\) with all other \(X_{j'}\ (j'\neq j)\) held constant, where \(j=1,2,...,p\).

Same assumptions as simple linear regression (SLR).

Estimations of parameters, confidence intervals calculations, \(R^2\), residual types, etc., are largely similar (but more complicated) as SLR.

Order of MLR model fitting: lm(y ~ x1 + x2) and lm(y ~ x2 + x1) would have the same output when use summary(), but this is not the case when use anova() because the tested hypotheses are different. For details, Google different types of sums of squares (SS). By default, R calculates Type I SS (sequential).

ANCOVA

Rye = read.csv("https://git.psu.edu/stat/essential-r/-/raw/master/Data/Rye-ANCOVA-2008.csv")
head(Rye)
##   Plant Term Rep RyeDMg Pdate Tdate GrDay Tdate2
## 1    P1   T1   I  443.0   257  20.0 233.0   20.0
## 2    P2   T1   I  287.0   288  20.0 202.0   20.0
## 3    P1   T2   I  631.5   257  32.5 245.5   32.5
## 4    P2   T2   I  513.0   288  32.5 214.5   32.5
## 5    P1   T3   I  771.5   257  39.5 252.5   39.5
## 6    P2   T3   I  570.5   288  39.5 221.5   39.5
summary(lm(RyeDMg ~ Tdate, data = Rye))
## 
## Call:
## lm(formula = RyeDMg ~ Tdate, data = Rye)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -195.779  -87.402    2.592   82.281  164.221 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   96.232     86.083   1.118    0.276    
## Tdate         16.963      2.715   6.249 2.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 107.3 on 22 degrees of freedom
## Multiple R-squared:  0.6396, Adjusted R-squared:  0.6232 
## F-statistic: 39.04 on 1 and 22 DF,  p-value: 2.739e-06
summary(lm(RyeDMg ~ Plant, data = Rye))
## 
## Call:
## lm(formula = RyeDMg ~ Plant, data = Rye)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -256.63 -115.16   24.06   83.00  332.88 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   689.25      46.68  14.766 6.73e-13 ***
## PlantP2      -145.63      66.01  -2.206   0.0381 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 161.7 on 22 degrees of freedom
## Multiple R-squared:  0.1811, Adjusted R-squared:  0.1439 
## F-statistic: 4.867 on 1 and 22 DF,  p-value: 0.03812
summary(lm(RyeDMg ~ Tdate + Plant, data = Rye))
## 
## Call:
## lm(formula = RyeDMg ~ Tdate + Plant, data = Rye)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -122.967  -65.880   -5.537   58.567  183.033 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   169.04      64.12   2.636 0.015429 *  
## Tdate          16.96       1.96   8.656 2.27e-08 ***
## PlantP2      -145.62      31.61  -4.607 0.000152 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.43 on 21 degrees of freedom
## Multiple R-squared:  0.8208, Adjusted R-squared:  0.8037 
## F-statistic: 48.08 on 2 and 21 DF,  p-value: 1.45e-08
model1 <- lm(RyeDMg ~ Tdate + Plant + Tdate:Plant, data = Rye)
summary(model1)
## 
## Call:
## lm(formula = RyeDMg ~ Tdate + Plant + Tdate:Plant, data = Rye)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -136.928  -59.276   -5.537   59.179  169.072 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    217.514     88.641   2.454   0.0234 *  
## Tdate           15.383      2.795   5.503 2.19e-05 ***
## PlantP2       -242.565    125.358  -1.935   0.0673 .  
## Tdate:PlantP2    3.161      3.953   0.800   0.4333    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.1 on 20 degrees of freedom
## Multiple R-squared:  0.8263, Adjusted R-squared:  0.8003 
## F-statistic: 31.71 on 3 and 20 DF,  p-value: 8.487e-08
plot(RyeDMg ~ Tdate, data = Rye, col = as.factor(Rye$Plant))
mcf <- summary(model1)$coef[, 1] #coefficient estimates only
abline(a = mcf[1], b = mcf[2])
abline(a = mcf[1] + mcf[3], b = mcf[2] + mcf[4], lty = 2, col = "red")
library(tidyverse, warn.conflicts = FALSE)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.1 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

ggplot(Rye, aes(x = Tdate, y = RyeDMg, color = Plant)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Multicollinearity

Multicollinearity arises when there is a strong linear relationship among the predictor variables (X1, X2, …, Xp).

Effect of multicollinearity on the inference of the regression coefficients:

  • Larger changes in \(\hat{\beta_j}\) when another predictor variable is added or deleted.
  • Larger changes in \(\hat{\beta_j}\) when another observation is added or deleted.
  • Larger s.e.(\(\hat{\beta_j}\)).
  • Harder to interpret \(\hat{\beta_j}\) as the effect of \(X_j\) on \(Y\) , because the other \(X_{j'}\) ’s cannot be held constant.
  • Estimation/prediction of Y is not a problem if within the data range.
  • The sign of \(\hat{\beta_j}\) is counter-intuitive.

Variance Inflation Factor (VIF):

For \(j = 1, ..., p\), a variance inflation factor (VIF) for \(\hat{\beta_j}\) is defined as

\[VIF_j=\frac{1}{1-R_j^2}\]

where \(R_j^2\) is the coefficient of multiple determination when \(X_j\) is regressed on the other \(p − 1\ X_{j'}\)’s predictor variables.

If the largest VIF value among \(VIF_j\) (j = 1, …, p) is larger than 10, multicollinearity may have a large impact on the inference. If the mean VIF values of \(VIF_j\)’s is considerably greater than 1, there may be serious multicollinearity problems.

car::vif()

Remedial Measures for Multicollinearity:

  • Multicollinearity is not a modeling error, but rather a condition of the predictor variables.
  • One remedial measure is to add data from outside the current data range.
  • In practice, however, it may be hard to obtain observations outside the data range.
  • Alternatively, one can consider more advanced statistical methods (e.g. principal component regression, ridge regression).
  • Drop one or more predictor variables from the model. [Which one(s)?]
  • Create new predictor variables (e.g. \(X' = X2 + X3\)).

Model selection

There are many different ways for testing different combinations of variables in models (forward selection, backward selection, both directions, …). But there are critiques of all these methods (biased p-values, etc.). In R, you can use the step() function to conduct model selection.

args(step)
## function (object, scope, scale = 0, direction = c("both", "backward", 
##     "forward"), trace = 1, keep = NULL, steps = 1000, k = 2, 
##     ...) 
## NULL

The most defensible practice is to build a model based on hypothesis.

General principles: Use simple models. Let science and common sense guide approach. Sometimes there is no single “best” model. There may not be enough information in the data to tell what the truth is exactly.

Nonlinear models

Generalized linear models (GLMs)

GLMs are a generlization of linear models and can be used to fit linear regression models, logistic regression models, Poisson regression models, etc.

In R, you can model all of these different types of models using the glm function.

args(glm)
## function (formula, family = gaussian, data, weights, subset, 
##     na.action, start = NULL, etastart, mustart, offset, control = list(...), 
##     model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, ...) 
## NULL

The following function families are available in R:

binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")
quasi(link = "identity", variance = "constant")
quasibinomial(link = "logit")
quasipoisson(link = "log")