In regression models (and plots sometimes), we need to use formulas
to express relationships between variables. R provides a
*formula* class for this purpose.

```
sample_f = as.formula(y ~ x1 + x2 + x3)
sample_f
```

`## y ~ x1 + x2 + x3`

`class(sample_f)`

`## [1] "formula"`

`typeof(sample_f)`

`## [1] "language"`

Here is an explanation of the meaning of different items in formulas:

- Variable names: varaibles that hold the data
`~`

: to show relationship between the response variable (to the left) and the stimulus variables (to the right)`+`

: to express a linear relationship between variables, e.g.,`x1 + x2 + x3`

`0`

or`-1`

: when added to a formular, indicate that no intercept term should be included`|`

: to specify conditioning or grouping variables; e.g.`(1 | species)`

`I()`

: identity function, to indicate that the enclosed expression should be interpreted by its arithmetic meaning; e.g.,`I(a+b)`

means that a variable “a plus b”, not`a + b`

“a and b”.`*`

: indicates interactions between variables:`y ~ x1 * x2`

is equivalent to`y ~ x1 + x2 + x1:x2`

`^`

: indicates crossing to a specific degree:`y ~ (x1 + x2)^2`

is equivalent to`y ~ (x1 + x2) * (x1 + x2)`

- function of variables: indicates that the function of the specified
variables should be interpreted as a variable: e,g,
`y ~ log(x)`

,`y ~ poly(x)`

,`y ~ s(x)`

for`gam`

.

In most cases, we have more than one predictor variable that needs to be included in the regression model. We call such models multiple regression. When predictors include both numeric and categorical variables, this is ANCOVA (Analysis of co-variance).

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \varepsilon\]

\(\beta_0\): intercept, expected value of Y when \(X_1=X_2=...=X_p=0\)

\(\beta_j\): slope (partial regression coefficient), expected change in Y for one-unit increase in \(X_j\) with all other \(X_{j'}\ (j'\neq j)\) held constant, where \(j=1,2,...,p\).

Same assumptions as simple linear regression (SLR).

Estimations of parameters, confidence intervals calculations, \(R^2\), residual types, etc., are largely similar (but more complicated) as SLR.

**Order of MLR model fitting**:
`lm(y ~ x1 + x2)`

and `lm(y ~ x2 + x1)`

would have
the same output when use `summary()`

, but this is not the
case when use `anova()`

because the tested hypotheses are
different. For details, Google different types of sums of squares (SS).
By default, R calculates Type I SS (sequential).

```
Rye = read.csv("https://git.psu.edu/stat/essential-r/-/raw/master/Data/Rye-ANCOVA-2008.csv")
head(Rye)
```

```
## Plant Term Rep RyeDMg Pdate Tdate GrDay Tdate2
## 1 P1 T1 I 443.0 257 20.0 233.0 20.0
## 2 P2 T1 I 287.0 288 20.0 202.0 20.0
## 3 P1 T2 I 631.5 257 32.5 245.5 32.5
## 4 P2 T2 I 513.0 288 32.5 214.5 32.5
## 5 P1 T3 I 771.5 257 39.5 252.5 39.5
## 6 P2 T3 I 570.5 288 39.5 221.5 39.5
```

`summary(lm(RyeDMg ~ Tdate, data = Rye))`

```
##
## Call:
## lm(formula = RyeDMg ~ Tdate, data = Rye)
##
## Residuals:
## Min 1Q Median 3Q Max
## -195.779 -87.402 2.592 82.281 164.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.232 86.083 1.118 0.276
## Tdate 16.963 2.715 6.249 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.3 on 22 degrees of freedom
## Multiple R-squared: 0.6396, Adjusted R-squared: 0.6232
## F-statistic: 39.04 on 1 and 22 DF, p-value: 2.739e-06
```

`summary(lm(RyeDMg ~ Plant, data = Rye))`

```
##
## Call:
## lm(formula = RyeDMg ~ Plant, data = Rye)
##
## Residuals:
## Min 1Q Median 3Q Max
## -256.63 -115.16 24.06 83.00 332.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 689.25 46.68 14.766 6.73e-13 ***
## PlantP2 -145.63 66.01 -2.206 0.0381 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 161.7 on 22 degrees of freedom
## Multiple R-squared: 0.1811, Adjusted R-squared: 0.1439
## F-statistic: 4.867 on 1 and 22 DF, p-value: 0.03812
```

`summary(lm(RyeDMg ~ Tdate + Plant, data = Rye))`

```
##
## Call:
## lm(formula = RyeDMg ~ Tdate + Plant, data = Rye)
##
## Residuals:
## Min 1Q Median 3Q Max
## -122.967 -65.880 -5.537 58.567 183.033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 169.04 64.12 2.636 0.015429 *
## Tdate 16.96 1.96 8.656 2.27e-08 ***
## PlantP2 -145.62 31.61 -4.607 0.000152 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.43 on 21 degrees of freedom
## Multiple R-squared: 0.8208, Adjusted R-squared: 0.8037
## F-statistic: 48.08 on 2 and 21 DF, p-value: 1.45e-08
```

```
model1 <- lm(RyeDMg ~ Tdate + Plant + Tdate:Plant, data = Rye)
summary(model1)
```

```
##
## Call:
## lm(formula = RyeDMg ~ Tdate + Plant + Tdate:Plant, data = Rye)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.928 -59.276 -5.537 59.179 169.072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 217.514 88.641 2.454 0.0234 *
## Tdate 15.383 2.795 5.503 2.19e-05 ***
## PlantP2 -242.565 125.358 -1.935 0.0673 .
## Tdate:PlantP2 3.161 3.953 0.800 0.4333
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.1 on 20 degrees of freedom
## Multiple R-squared: 0.8263, Adjusted R-squared: 0.8003
## F-statistic: 31.71 on 3 and 20 DF, p-value: 8.487e-08
```

```
plot(RyeDMg ~ Tdate, data = Rye, col = as.factor(Rye$Plant))
mcf <- summary(model1)$coef[, 1] #coefficient estimates only
abline(a = mcf[1], b = mcf[2])
abline(a = mcf[1] + mcf[3], b = mcf[2] + mcf[4], lty = 2, col = "red")
library(tidyverse, warn.conflicts = FALSE)
```

```
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
```

```
ggplot(Rye, aes(x = Tdate, y = RyeDMg, color = Plant)) +
geom_point() +
geom_smooth(method = "lm")
```

`## `geom_smooth()` using formula 'y ~ x'`

*Multicollinearity* arises when there is a strong linear
relationship among the predictor variables (X1, X2, …, Xp).

Effect of multicollinearity on the inference of the regression coefficients:

- Larger changes in \(\hat{\beta_j}\) when another predictor variable is added or deleted.
- Larger changes in \(\hat{\beta_j}\) when another observation is added or deleted.
- Larger s.e.(\(\hat{\beta_j}\)).
- Harder to interpret \(\hat{\beta_j}\) as the effect of \(X_j\) on \(Y\) , because the other \(X_{j'}\) ’s cannot be held constant.
- Estimation/prediction of Y is not a problem if within the data range.
- The sign of \(\hat{\beta_j}\) is counter-intuitive.

*Variance Inflation Factor (VIF)*:

For \(j = 1, ..., p\), a variance inflation factor (VIF) for \(\hat{\beta_j}\) is defined as

\[VIF_j=\frac{1}{1-R_j^2}\]

where \(R_j^2\) is the coefficient of multiple determination when \(X_j\) is regressed on the other \(p − 1\ X_{j'}\)’s predictor variables.

If the largest VIF value among \(VIF_j\) (j = 1, …, p) is larger than 10, multicollinearity may have a large impact on the inference. If the mean VIF values of \(VIF_j\)’s is considerably greater than 1, there may be serious multicollinearity problems.

`car::vif()`

*Remedial Measures for Multicollinearity*:

- Multicollinearity is not a modeling error, but rather a condition of the predictor variables.
- One remedial measure is to add data from outside the current data range.
- In practice, however, it may be hard to obtain observations outside the data range.
- Alternatively, one can consider more advanced statistical methods (e.g. principal component regression, ridge regression).
- Drop one or more predictor variables from the model. [Which one(s)?]
- Create new predictor variables (e.g. \(X' = X2 + X3\)).

There are many different ways for testing different combinations of
variables in models (forward selection, backward selection, both
directions, …). But there are critiques of all these methods (biased
p-values, etc.). In R, you can use the `step()`

function to
conduct model selection.

`args(step)`

```
## function (object, scope, scale = 0, direction = c("both", "backward",
## "forward"), trace = 1, keep = NULL, steps = 1000, k = 2,
## ...)
## NULL
```

The most defensible practice is to build a model based on hypothesis.

General principles: Use simple models. Let science and common sense guide approach. Sometimes there is no single “best” model. There may not be enough information in the data to tell what the truth is exactly.

GLMs are a generlization of linear models and can be used to fit linear regression models, logistic regression models, Poisson regression models, etc.

In R, you can model all of these different types of models using the
`glm`

function.

`args(glm)`

```
## function (formula, family = gaussian, data, weights, subset,
## na.action, start = NULL, etastart, mustart, offset, control = list(...),
## model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, singular.ok = TRUE,
## contrasts = NULL, ...)
## NULL
```

The following function families are available in R:

```
binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")
quasi(link = "identity", variance = "constant")
quasibinomial(link = "logit")
quasipoisson(link = "log")
```

A brief introduction to mixed effects modelling and multi-model inference in ecology