Statistics is a discipline where relatively simple models are applied to approximately describe “random” phenomena observed in the real world and inference/prediction are made.
Regression is one of the most basic but fundamental statistical tools.
For observations i=1,2,...,n,
yi=β0+β1xi+εi
Model assumptions:
Model fitting by least squares: residual sum of squares (RSS)
RSS=n∑i=1{yi−(β0+β1xi)}2
Find β0 and β1 that have the lowest RSS. For mathmatical details, read a book on linear models.
A study was conducted to evaluate the utilization of phosphorus (P) by 9 corn plants. The data consist of x, the inorganic P in soil (ppm), and y, the plant-available P (ppm).
d = data.frame(x = c(1, 4, 5, 9, 13, 11, 23, 23, 28),
y = c(64, 71, 54, 81, 93, 76, 77, 95, 109))
plot(d$x, d$y, xlab = "Soil inorganic P", ylab = "Plant available P")
cor(d$x, d$y)
## [1] 0.8049892
fit_d = lm(y ~ x, data = d)
summary(fit_d)
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.169 -1.166 1.003 6.668 13.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.5804 6.2477 9.857 2.35e-05 ***
## x 1.4169 0.3947 3.590 0.00886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared: 0.648, Adjusted R-squared: 0.5977
## F-statistic: 12.89 on 1 and 7 DF, p-value: 0.008859
# beta 0
coef(fit_d)[1]
## (Intercept)
## 61.58038
# beta 1 Units??
coef(fit_d)["x"]
## x
## 1.416894
So, the linear regression line for this dataset is
ˆY=61.58+1.417X
Tests of Hypotheses:
Partition the sum of squares (SS)
SST=SSR+SSE
SST=Σni=1(yi−ˉy)2,df=n−1
knitr::include_graphics("https://bookdown.org/egarpor/SSS2-UC3M/images/R/anova.png")
ANOVA table:
Source | df | SS | MS | F |
---|---|---|---|---|
Regression | 1 | SSR | MSR | MSR/MSE |
Error | n-2 | SSE | MSE | |
Total | n-1 | SST |
Distribution under H0:β1=0,
F=MSRMSE∼F1,n−2
ˆσ2=MSE=SSEn−2
anova(fit_d)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 1473.57 1473.57 12.887 0.008859 **
## Residuals 7 800.43 114.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus, reject H0 at the 5% level. There is strong evidence of a relationship between soil inorganic P and plant-available P.
Distribution under H0:β1=0,
T=^β1−0s.e.(^β1)∼Tn−2
s.e.(^β1)=ˆσ√∑ni=1(xi−ˉx)2,ˆσ=√MSE
summary(fit_d)
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.169 -1.166 1.003 6.668 13.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.5804 6.2477 9.857 2.35e-05 ***
## x 1.4169 0.3947 3.590 0.00886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared: 0.648, Adjusted R-squared: 0.5977
## F-statistic: 12.89 on 1 and 7 DF, p-value: 0.008859
Thus, reject H0 at the 5% level. Same conclusion as the F test above.
Note: in simple linear regression only, t2=(3.59)2=12.887=f
plot(d$x, d$y, xlab = "Soil inorganic P", ylab = "Plant available P")
abline(fit_d)
The (1 − α) × 100% confidence interval for β1 is
^β1±tn−2,α/2×s.e.(^β1)
The (1 − α) × 100% confidence interval for β0 is
^β0±tn−2,α/2×s.e.(^β0)
Exercise: t7,0.025=2.36, calculate the 95% CI of β0 and β1.
confint(fit_d)
x = summary(fit_d)
x
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.169 -1.166 1.003 6.668 13.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.5804 6.2477 9.857 2.35e-05 ***
## x 1.4169 0.3947 3.590 0.00886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared: 0.648, Adjusted R-squared: 0.5977
## F-statistic: 12.89 on 1 and 7 DF, p-value: 0.008859
x$r.squared
## [1] 0.6480077
Estimate (predict) Y at a given X=x0 by ^y0=^β0+^β1x0
But the standard error of the prediction depends on the objective.
predict(fit_d, newdata = data.frame(x = 18), interval = "prediction")
## fit lwr upr
## 1 87.08447 60.02557 114.1434
predict(fit_d, newdata = data.frame(x = 18), interval = "confidence")
## fit lwr upr
## 1 87.08447 77.45028 96.71865
Correct inference hinges on correct (reasonable) model assumptions. Hence it is important to evaluate the model assumptions, that is, to perform model diagnostics. Our approach is to examine the residuals.
Raw residual = Observed value − Fitted value
A simple and effective method for model diagnostics is the examination of residual plots.
Possible Patterns in Residual Plots:
par(mfrow = c(2, 2))
plot(fit_d)
Cook’s distance measures the influence of the i th observation on all n fitted values. A practical rule: If Di > 1, investigate the i th observation as possibly influential.
When the model diagnostic indicates violation of model assumptions, we may can try transform the variables to meet model assumptions.
library(MASS)
data(mammals)
model1 <- lm(brain ~ body, data = mammals)
summary(model1)
##
## Call:
## lm(formula = brain ~ body, data = mammals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -810.07 -88.52 -79.64 -13.02 2050.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.00440 43.55258 2.09 0.0409 *
## body 0.96650 0.04766 20.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared: 0.8727, Adjusted R-squared: 0.8705
## F-statistic: 411.2 on 1 and 60 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model1)
m_par = par(mfrow = c(1, 2))
plot(mammals$body, mammals$brain)
plot(log10(mammals$body), log10(mammals$brain))
par(m_par)
model2 <- lm(log(brain) ~ log(body), data = mammals)
summary(model2)
##
## Call:
## lm(formula = log(brain) ~ log(body), data = mammals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71550 -0.49228 -0.06162 0.43597 1.94829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.13479 0.09604 22.23 <2e-16 ***
## log(body) 0.75169 0.02846 26.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6943 on 60 degrees of freedom
## Multiple R-squared: 0.9208, Adjusted R-squared: 0.9195
## F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model2)