Statistics is a discipline where relatively simple models are applied to approximately describe “random” phenomena observed in the real world and inference/prediction are made.
Regression is one of the most basic but fundamental statistical tools.
For observations \(i = 1,2, ..., n\),
\[y_i = \beta_0 + \beta_1x_i+\varepsilon_i\] where \[\varepsilon_i \sim iidN(0, \sigma^2)\]
Model assumptions:
Model fitting by least squares: residual sum of squares (RSS)
\[ RSS = \sum_{i=1}^n \left\{ y_i - \left(\beta_0 + \beta_1 x_i \right)\right\}^2 \]
Find \(\beta_0\) and \(\beta_1\) that have the lowest RSS. For mathmatical details, read a book on linear models.
A study was conducted to evaluate the utilization of phosphorus (P) by 9 corn plants. The data consist of x, the inorganic P in soil (ppm), and y, the plant-available P (ppm).
d = data.frame(x = c(1, 4, 5, 9, 13, 11, 23, 23, 28),
y = c(64, 71, 54, 81, 93, 76, 77, 95, 109))
plot(d$x, d$y, xlab = "Soil inorganic P", ylab = "Plant available P")
cor(d$x, d$y)
## [1] 0.8049892
fit_d = lm(y ~ x, data = d)
summary(fit_d)
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.169 -1.166 1.003 6.668 13.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.5804 6.2477 9.857 2.35e-05 ***
## x 1.4169 0.3947 3.590 0.00886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared: 0.648, Adjusted R-squared: 0.5977
## F-statistic: 12.89 on 1 and 7 DF, p-value: 0.008859
# beta 0
coef(fit_d)[1]
## (Intercept)
## 61.58038
# beta 1 Units??
coef(fit_d)["x"]
## x
## 1.416894
So, the linear regression line for this dataset is
\[\hat{Y} = 61.58 + 1.417 X\]
Tests of Hypotheses:
Partition the sum of squares (SS)
\[SST=SSR+SSE\] with
\[SST=\mathop{\Sigma_{i=1}^{n}(y_{i}-\bar{y})^{2},df=n-1}\] \[SSR=\Sigma_{i=1}^{n}(\hat{y}_{i}-\bar{y})^{2},df=1\] \[SSE=\Sigma_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}=SST-SSR,df=n-2\]
knitr::include_graphics("https://bookdown.org/egarpor/SSS2-UC3M/images/R/anova.png")
ANOVA table:
Source | df | SS | MS | F |
---|---|---|---|---|
Regression | 1 | SSR | MSR | MSR/MSE |
Error | n-2 | SSE | MSE | |
Total | n-1 | SST |
Distribution under \(H_0 :\beta_1 = 0\),
\[F=\frac{MSR}{MSE}\sim F_{1,n-2}\] Estimate the error variance \(\sigma^2\):
\[\hat{\sigma}^{2}=MSE=\frac{SSE}{n-2}\]
anova(fit_d)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 1473.57 1473.57 12.887 0.008859 **
## Residuals 7 800.43 114.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus, reject \(H_0\) at the 5% level. There is strong evidence of a relationship between soil inorganic P and plant-available P.
Distribution under \(H_0 :\beta_1 = 0\),
\[T=\frac{\hat{\beta_{1}}-0}{s.e.(\hat{\beta_{1}})}\sim T_{n-2}\]
\[s.e.(\hat{\beta_{1}})=\frac{\hat{\sigma}}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}},\hat{\sigma}=\sqrt{MSE}\]
summary(fit_d)
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.169 -1.166 1.003 6.668 13.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.5804 6.2477 9.857 2.35e-05 ***
## x 1.4169 0.3947 3.590 0.00886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared: 0.648, Adjusted R-squared: 0.5977
## F-statistic: 12.89 on 1 and 7 DF, p-value: 0.008859
Thus, reject \(H_0\) at the 5% level. Same conclusion as the F test above.
Note: in simple linear regression only, \(t^{2}=(3.59)^{2}=12.887=f\)
plot(d$x, d$y, xlab = "Soil inorganic P", ylab = "Plant available P")
abline(fit_d)
The (1 − α) × 100% confidence interval for \(\beta_1\) is
\[\hat{\beta_{1}}\pm t_{n-2,\alpha/2}\times s.e.(\hat{\beta_{1}})\]
The (1 − α) × 100% confidence interval for \(\beta_0\) is
\[\hat{\beta_{0}}\pm t_{n-2,\alpha/2}\times s.e.(\hat{\beta_{0}})\]
Exercise: \(t_{7,0.025}=2.36\), calculate the 95% CI of \(\beta_0\) and \(\beta_1\).
confint(fit_d)
x = summary(fit_d)
x
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.169 -1.166 1.003 6.668 13.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.5804 6.2477 9.857 2.35e-05 ***
## x 1.4169 0.3947 3.590 0.00886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared: 0.648, Adjusted R-squared: 0.5977
## F-statistic: 12.89 on 1 and 7 DF, p-value: 0.008859
x$r.squared
## [1] 0.6480077
Estimate (predict) Y at a given \(X=x_{0}\) by \(\hat{y_{0}}=\hat{\beta_{0}}+\hat{\beta_{1}}x_{0}\)
But the standard error of the prediction depends on the objective.
predict(fit_d, newdata = data.frame(x = 18), interval = "prediction")
## fit lwr upr
## 1 87.08447 60.02557 114.1434
predict(fit_d, newdata = data.frame(x = 18), interval = "confidence")
## fit lwr upr
## 1 87.08447 77.45028 96.71865
Correct inference hinges on correct (reasonable) model assumptions. Hence it is important to evaluate the model assumptions, that is, to perform model diagnostics. Our approach is to examine the residuals.
Raw residual = Observed value − Fitted value
A simple and effective method for model diagnostics is the examination of residual plots.
Possible Patterns in Residual Plots:
par(mfrow = c(2, 2))
plot(fit_d)
Cook’s distance measures the influence of the i th observation on all n fitted values. A practical rule: If Di > 1, investigate the i th observation as possibly influential.
When the model diagnostic indicates violation of model assumptions, we may can try transform the variables to meet model assumptions.
library(MASS)
data(mammals)
model1 <- lm(brain ~ body, data = mammals)
summary(model1)
##
## Call:
## lm(formula = brain ~ body, data = mammals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -810.07 -88.52 -79.64 -13.02 2050.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.00440 43.55258 2.09 0.0409 *
## body 0.96650 0.04766 20.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared: 0.8727, Adjusted R-squared: 0.8705
## F-statistic: 411.2 on 1 and 60 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model1)
m_par = par(mfrow = c(1, 2))
plot(mammals$body, mammals$brain)
plot(log10(mammals$body), log10(mammals$brain))
par(m_par)
model2 <- lm(log(brain) ~ log(body), data = mammals)
summary(model2)
##
## Call:
## lm(formula = log(brain) ~ log(body), data = mammals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71550 -0.49228 -0.06162 0.43597 1.94829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.13479 0.09604 22.23 <2e-16 ***
## log(body) 0.75169 0.02846 26.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6943 on 60 degrees of freedom
## Multiple R-squared: 0.9208, Adjusted R-squared: 0.9195
## F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model2)