Regression

Statistics is a discipline where relatively simple models are applied to approximately describe “random” phenomena observed in the real world and inference/prediction are made.

Example

A study was conducted to evaluate the utilization of phosphorus (P) by 9 corn plants. The data consist of x, the inorganic P in soil (ppm), and y, the plant-available P (ppm).

d = data.frame(x = c(1, 4, 5, 9, 13, 11, 23, 23, 28),
               y = c(64, 71, 54, 81, 93, 76, 77, 95, 109))
plot(d$x, d$y, xlab = "Soil inorganic P", ylab = "Plant available P")

cor(d$x, d$y)

## [1] 0.8049892

fit_d = lm(y ~ x, data = d)
summary(fit_d)

## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.169  -1.166   1.003   6.668  13.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.5804     6.2477   9.857 2.35e-05 ***
## x             1.4169     0.3947   3.590  0.00886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared:  0.648,  Adjusted R-squared:  0.5977 
## F-statistic: 12.89 on 1 and 7 DF,  p-value: 0.008859

# beta 0
coef(fit_d)[1]

## (Intercept) 
##    61.58038

# beta 1  Units??
coef(fit_d)["x"]

##        x 
## 1.416894

So, the linear regression line for this dataset is

Tests of Hypotheses:

A specific test of hypothesis about the slope can be conducted by either F-test using the idea of analysis of variance (ANOVA) or T-test.
More general test of hypothesis about the slope or about the intercept can be conducted by T-test.

ANOVA

Partition the sum of squares (SS)

with

knitr::include_graphics("https://bookdown.org/egarpor/SSS2-UC3M/images/R/anova.png")

ANOVA table:

Source	df	SS	MS	F
Regression	1	SSR	MSR	MSR/MSE
Error	n-2	SSE	MSE
Total	n-1	SST

Distribution under ,

Estimate the error variance

anova(fit_d)

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## x          1 1473.57 1473.57  12.887 0.008859 **
## Residuals  7  800.43  114.35                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, reject at the 5% level. There is strong evidence of a relationship between soil inorganic P and plant-available P.

T-test for

Distribution under ,

summary(fit_d)

## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.169  -1.166   1.003   6.668  13.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.5804     6.2477   9.857 2.35e-05 ***
## x             1.4169     0.3947   3.590  0.00886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared:  0.648,  Adjusted R-squared:  0.5977 
## F-statistic: 12.89 on 1 and 7 DF,  p-value: 0.008859

Thus, reject at the 5% level. Same conclusion as the F test above.

Note: in simple linear regression only,

plot(d$x, d$y, xlab = "Soil inorganic P", ylab = "Plant available P")
abline(fit_d)

Confidence interval for and

The (1 − α) × 100% confidence interval for is

Exercise: , calculate the 95% CI of and .

confint(fit_d)

Measuring the quality of fit

Strength of the linear relationship: F test or T test.
Sample correlation between Y and X.
Sample correlation between Y and .
The coefficient of determination:

x = summary(fit_d)
x

## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.169  -1.166   1.003   6.668  13.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.5804     6.2477   9.857 2.35e-05 ***
## x             1.4169     0.3947   3.590  0.00886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 7 degrees of freedom
## Multiple R-squared:  0.648,  Adjusted R-squared:  0.5977 
## F-statistic: 12.89 on 1 and 7 DF,  p-value: 0.008859

x$r.squared

## [1] 0.6480077

Prediction of Y

Estimate (predict) Y at a given by

But the standard error of the prediction depends on the objective.

is a predictor of a new/future observation at (prediction interval)
is an estimator of the expected value (confidence interval; denote as )

predict(fit_d, newdata = data.frame(x = 18), interval = "prediction")

##        fit      lwr      upr
## 1 87.08447 60.02557 114.1434

predict(fit_d, newdata = data.frame(x = 18), interval = "confidence")

##        fit      lwr      upr
## 1 87.08447 77.45028 96.71865

Model Diagnostics

Correct inference hinges on correct (reasonable) model assumptions. Hence it is important to evaluate the model assumptions, that is, to perform model diagnostics. Our approach is to examine the residuals.

Raw residual = Observed value − Fitted value

A simple and effective method for model diagnostics is the examination of residual plots.

Normal probability plot: The plot of ordered residuals versus normal scores. Check the normality assumption (a nearly straight line).
Scatter plot of residual () against predictor variable (). Check the linearity and equal-variance assumption (random scatter).
Scatter plot of residual () against response variable (). Check the linearity and equal-variance assumption (random scatter).
Index plot: The plot of residuals against the observation number. Check the independence assumption (e.g., over time or space).

Possible Patterns in Residual Plots:

Curvature: Non-linearity.
Fan shape: Unequal variances.
Outlier: Extremely large residuals.
There are no golden rules nor magic formulas.
Decision may be difficult to make for small sample size. However, small/minor violations of the model assumptions do not invalidate conclusions in a major way.
Gross violations of the model assumptions can seriously alter conclusions.
Model diagnostics help achieve a thorough and informative analysis of the data.
Some formal testing procedures are available, such as a formal test for outliers.

par(mfrow = c(2, 2))
plot(fit_d)

Cook’s distance measures the influence of the i th observation on all n fitted values. A practical rule: If Di > 1, investigate the i th observation as possibly influential.

Transformation of data

When the model diagnostic indicates violation of model assumptions, we may can try transform the variables to meet model assumptions.

library(MASS) 
data(mammals)
model1 <- lm(brain ~ body, data = mammals) 
summary(model1)

## 
## Call:
## lm(formula = brain ~ body, data = mammals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -810.07  -88.52  -79.64  -13.02 2050.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.00440   43.55258    2.09   0.0409 *  
## body         0.96650    0.04766   20.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared:  0.8727, Adjusted R-squared:  0.8705 
## F-statistic: 411.2 on 1 and 60 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model1)

m_par = par(mfrow = c(1, 2))
plot(mammals$body, mammals$brain)
plot(log10(mammals$body), log10(mammals$brain))

par(m_par)
model2 <- lm(log(brain) ~ log(body), data = mammals) 
summary(model2)

## 
## Call:
## lm(formula = log(brain) ~ log(body), data = mammals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71550 -0.49228 -0.06162  0.43597  1.94829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.13479    0.09604   22.23   <2e-16 ***
## log(body)    0.75169    0.02846   26.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6943 on 60 degrees of freedom
## Multiple R-squared:  0.9208, Adjusted R-squared:  0.9195 
## F-statistic: 697.4 on 1 and 60 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model2)

Regression

Daijiang Li

11/07/2023

Basic

Example

ANOVA

T-test for

Confidence interval for and

Measuring the quality of fit

Prediction of Y

Model Diagnostics

Transformation of data

Regression

Daijiang Li

11/07/2023

Basic

Example

ANOVA

T-test for H0:β1=0

Confidence interval for β0 and β1

Measuring the quality of fit

Prediction of Y

Model Diagnostics

Transformation of data

T-test for

Confidence interval for and