Statistics is a discipline where relatively simple models are applied to approximately describe “random” phenomena observed in the real world and inference/prediction are made.

Basic

Regression is one of the most basic but fundamental statistical tools.

For observations \(i = 1,2, ..., n\),

\[y_i = \beta_0 + \beta_1x_i+\varepsilon_i\] where \[\varepsilon_i \sim iidN(0, \sigma^2)\]

Model assumptions:

Model fitting by least squares: residual sum of squares (RSS)

\[ RSS = \sum_{i=1}^n \left\{ y_i - \left(\beta_0 + \beta_1 x_i \right)\right\}^2 \]

Find \(\beta_0\) and \(\beta_1\) that have the lowest RSS. For mathmatical details, read a book on linear models.

Example

A study was conducted to evaluate the utilization of phosphorus (P) by 9 corn plants. The data consist of x, the inorganic P in soil (ppm), and y, the plant-available P (ppm).

d = data.frame(x = c(1, 4, 5, 9, 13, 11, 23, 23, 28),
               y = c(64, 71, 54, 81, 93, 76, 77, 95, 109))
plot(d$x, d$y, xlab = "Soil inorganic P", ylab = "Plant available P")

cor(d$x, d$y)
## [1] 0.8049892
fit_d = lm(y ~ x, data = d)
summary(fit_d)
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.169  -1.166   1.003   6.668  13.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.5804     6.2477   9.857 2.35e-05 ***
## x             1.4169     0.3947   3.590  0.00886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared:  0.648,  Adjusted R-squared:  0.5977 
## F-statistic: 12.89 on 1 and 7 DF,  p-value: 0.008859
# beta 0
coef(fit_d)[1]
## (Intercept) 
##    61.58038
# beta 1  Units??
coef(fit_d)["x"]
##        x 
## 1.416894

So, the linear regression line for this dataset is

\[\hat{Y} = 61.58 + 1.417 X\]

Tests of Hypotheses:

ANOVA

Partition the sum of squares (SS)

\[SST=SSR+SSE\] with

\[SST=\mathop{\Sigma_{i=1}^{n}(y_{i}-\bar{y})^{2},df=n-1}\] \[SSR=\Sigma_{i=1}^{n}(\hat{y}_{i}-\bar{y})^{2},df=1\] \[SSE=\Sigma_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}=SST-SSR,df=n-2\]

knitr::include_graphics("https://bookdown.org/egarpor/SSS2-UC3M/images/R/anova.png")

ANOVA table:

Source df SS MS F
Regression 1 SSR MSR MSR/MSE
Error n-2 SSE MSE
Total n-1 SST

Distribution under \(H_0 :\beta_1 = 0\),

\[F=\frac{MSR}{MSE}\sim F_{1,n-2}\] Estimate the error variance \(\sigma^2\):

\[\hat{\sigma}^{2}=MSE=\frac{SSE}{n-2}\]

anova(fit_d)
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## x          1 1473.57 1473.57  12.887 0.008859 **
## Residuals  7  800.43  114.35                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, reject \(H_0\) at the 5% level. There is strong evidence of a relationship between soil inorganic P and plant-available P.

T-test for \(H_0 :\beta_1 = 0\)

Distribution under \(H_0 :\beta_1 = 0\),

\[T=\frac{\hat{\beta_{1}}-0}{s.e.(\hat{\beta_{1}})}\sim T_{n-2}\]

\[s.e.(\hat{\beta_{1}})=\frac{\hat{\sigma}}{\sqrt{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}},\hat{\sigma}=\sqrt{MSE}\]

summary(fit_d)
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.169  -1.166   1.003   6.668  13.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.5804     6.2477   9.857 2.35e-05 ***
## x             1.4169     0.3947   3.590  0.00886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared:  0.648,  Adjusted R-squared:  0.5977 
## F-statistic: 12.89 on 1 and 7 DF,  p-value: 0.008859

Thus, reject \(H_0\) at the 5% level. Same conclusion as the F test above.

Note: in simple linear regression only, \(t^{2}=(3.59)^{2}=12.887=f\)

plot(d$x, d$y, xlab = "Soil inorganic P", ylab = "Plant available P")
abline(fit_d)

Confidence interval for \(\beta_0\) and \(\beta_1\)

The (1 − α) × 100% confidence interval for \(\beta_1\) is

\[\hat{\beta_{1}}\pm t_{n-2,\alpha/2}\times s.e.(\hat{\beta_{1}})\]

The (1 − α) × 100% confidence interval for \(\beta_0\) is

\[\hat{\beta_{0}}\pm t_{n-2,\alpha/2}\times s.e.(\hat{\beta_{0}})\]

Exercise: \(t_{7,0.025}=2.36\), calculate the 95% CI of \(\beta_0\) and \(\beta_1\).

confint(fit_d)

Measuring the quality of fit

  1. Strength of the linear relationship: F test or T test.
  2. Sample correlation between Y and X.
  3. Sample correlation between Y and \(\hat{Y}\).
  4. The coefficient of determination: \(R^{2}=\frac{SSR}{SST}\)
x = summary(fit_d)
x
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.169  -1.166   1.003   6.668  13.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.5804     6.2477   9.857 2.35e-05 ***
## x             1.4169     0.3947   3.590  0.00886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared:  0.648,  Adjusted R-squared:  0.5977 
## F-statistic: 12.89 on 1 and 7 DF,  p-value: 0.008859
x$r.squared
## [1] 0.6480077

Prediction of Y

Estimate (predict) Y at a given \(X=x_{0}\) by \(\hat{y_{0}}=\hat{\beta_{0}}+\hat{\beta_{1}}x_{0}\)

But the standard error of the prediction depends on the objective.

  • \(\hat{y_{0}}\) is a predictor of a new/future observation at \(x_{0}\) (prediction interval)
  • \(\hat{y_{0}}\) is an estimator of the expected value \(y_{0}=E(Y|X=x_{0})=\beta_{0}+\beta_{1}x_{0}\) (confidence interval; denote \(\hat{y_{0}}\) as \(\hat{\mu_{0}}\))
predict(fit_d, newdata = data.frame(x = 18), interval = "prediction")
##        fit      lwr      upr
## 1 87.08447 60.02557 114.1434
predict(fit_d, newdata = data.frame(x = 18), interval = "confidence")
##        fit      lwr      upr
## 1 87.08447 77.45028 96.71865

Model Diagnostics

Correct inference hinges on correct (reasonable) model assumptions. Hence it is important to evaluate the model assumptions, that is, to perform model diagnostics. Our approach is to examine the residuals.

Raw residual = Observed value − Fitted value

A simple and effective method for model diagnostics is the examination of residual plots.

  • Normal probability plot: The plot of ordered residuals versus normal scores. Check the normality assumption (a nearly straight line).
  • Scatter plot of residual (\(\varepsilon_i\)) against predictor variable (\(x_i\)). Check the linearity and equal-variance assumption (random scatter).
  • Scatter plot of residual (\(\varepsilon_i\)) against response variable (\(y_i\)). Check the linearity and equal-variance assumption (random scatter).
  • Index plot: The plot of residuals against the observation number. Check the independence assumption (e.g., over time or space).

Possible Patterns in Residual Plots:

  • Curvature: Non-linearity.
  • Fan shape: Unequal variances.
  • Outlier: Extremely large residuals.
  • There are no golden rules nor magic formulas.
  • Decision may be difficult to make for small sample size. However, small/minor violations of the model assumptions do not invalidate conclusions in a major way.
  • Gross violations of the model assumptions can seriously alter conclusions.
  • Model diagnostics help achieve a thorough and informative analysis of the data.
  • Some formal testing procedures are available, such as a formal test for outliers.
par(mfrow = c(2, 2))
plot(fit_d)

Cook’s distance measures the influence of the i th observation on all n fitted values. A practical rule: If Di > 1, investigate the i th observation as possibly influential.

Transformation of data

When the model diagnostic indicates violation of model assumptions, we may can try transform the variables to meet model assumptions.

library(MASS) 
data(mammals)
model1 <- lm(brain ~ body, data = mammals) 
summary(model1)
## 
## Call:
## lm(formula = brain ~ body, data = mammals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -810.07  -88.52  -79.64  -13.02 2050.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.00440   43.55258    2.09   0.0409 *  
## body         0.96650    0.04766   20.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared:  0.8727, Adjusted R-squared:  0.8705 
## F-statistic: 411.2 on 1 and 60 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model1)

m_par = par(mfrow = c(1, 2))
plot(mammals$body, mammals$brain)
plot(log10(mammals$body), log10(mammals$brain))

par(m_par)
model2 <- lm(log(brain) ~ log(body), data = mammals) 
summary(model2)
## 
## Call:
## lm(formula = log(brain) ~ log(body), data = mammals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71550 -0.49228 -0.06162  0.43597  1.94829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.13479    0.09604   22.23   <2e-16 ***
## log(body)    0.75169    0.02846   26.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6943 on 60 degrees of freedom
## Multiple R-squared:  0.9208, Adjusted R-squared:  0.9195 
## F-statistic: 697.4 on 1 and 60 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model2)