STAT0023 Week 2
Linear regression and ANOVA in R
Ioanna Manolopoulou and Richard Chandler
Regression
Regression models
Recap: regression basics (STAT0006 or equivalent) | |
Data: Yi, x1i, x2i, . . . , xpi : i = 1, . . . , n Y is response variable; x’s are covariates Model: Yi = β0 + β1x1i + . . . + βpxpi + εi , |
εi ∼ N(0, σ2) |
AssignmentTutorOnline
Linear relationship between response and covariates
‘Errors’ εi are independent and normally distributed, with zero
mean and constant variance
β0 is intercept (expected response when all covariates are zero)
βj is regression coefficient (expected change in response when jth
covariate increases by 1 unit)
Estimation in regression models
Regression coefficients β0, β1, . . . , βp via least squares:
Choose parameter estimates βˆ0, βˆ1, . . . , βˆp to minimise sum of squares
SS =
nXi=1
(Yi – [β0 + β1x1i + . . . + βpxip])2
Minimised value is residual sum of squares (RSS)
Error variance σ2 as σˆ2 = RSS/(n – k) where k = p + 1
n – k is residual degrees of freedom (# of observations minus # of
coefficients estimated)
Regression in R
Command is lm (‘linear model’)
Model structure specified using a formula:
R formula notation
Generic form: Y ~ x1 + x2 + x3
Y is the response variable
x1, x2, x3 are the covariates
So, to regress Y on x1, x2 and x3: lm(Y x1 + x2 + x3)
Example: US Minimum Temperature dataset
Contains average daily minimum temperature in January
at 56 US cities over the time period 1931-1960.
Structure: city name, temperature, latitude (◦W of
Greenwich) and longitude (◦N of the Equator):
Temperature is measured in degrees Fahrenheit.
Thermometer image from https://www.weather-station-products.co.uk
> head(ustemp)
city min.temp latitude longitude
1 Mobile, AL 44 31.2 88.5
2 Montgomery, AL 38 32.9 86.8
3 Phoenix, AZ 35 33.6 112.5
4 Little Rock, AR 31 35.4 92.8
5 Los Angeles, CA 47 34.3 118.7
6 San Francisco, CA 42 38.4 123.0
US temperatures: visualising the data
°F
(0,10]
(10,20]
(20,30]
(30,40]
(40,50]
(50,60]
(60,70]
Mean US January daily minimum temperature, 1931-1960
Colour scale divides temperature range into equal intervals (NB
careful choice of colours — see comments in script during workshop).
Shows roughly linear (planar) variation of minimum temperature with
latitude and longitude, of the form
Temperature ≈ β0 + (β1 × Latitude) + (β2 × Longitude)
US temperatures: fitting a linear model
> Temp.Model1 <-
+ lm(min.temp ~ latitude + longitude, data=ustemp)
> print(Temp.Model1)
Call:
lm(formula = min.temp ~ latitude + longitude, data = ustemp)
Coefficients:
(Intercept) latitude longitude
98.645 -2.164 0.134
Note that Temp.Model1 is an object of class ‘lm‘, hence print()
produces useful information about the fitted model (object classes
were covered in Workshop 1).
Regression models: more theory
Matrix representation of the linear regression model |
Model for all observations can equivalently be written in matrix form as |
Y1
Y2…Yn
=
1 x11 x12 · · · x1p
1 x21 x22 · · · x2p
…
…
…
.
.
.
…
1 xn1 xn2 · · · xnp
β0
β1
…β
p
+
ε1
ε2
…ε
n
or
Y
|z
n×1
= X
|z
n×(p+1)
β
|z
(p+1)×1
+ ε
|z
n×1
X is design matrix for the model
Inference about the coefficients
Least-squares estimator of coefficient vector β can be written as
βˆ = (X0X)-1 X0Y.
Under model assumptions, βˆ has multivariate normal distribution with
mean vector β and covariance matrix σ2 (X0X)-1.
Standard errors of coefficient estimators are square roots of diagonal
elements of covariance matrix
Need to replace σ2 with its estimate σˆ2 = RSS/(n – k) where k is #
of coefficients estimated
To test H0 : βj = 0, use test statistic tj = βˆj/s.e. ˆ (βj) and compare
with t distribution on n – k degrees of freedom.
NB beware collinearity: if covariates are highly correlated then test
results can be sensitive to other covariates in model.
Can also use to derive confidence intervals for individual coefficients
(command confint(), confidence intervals for the regression line
and prediction intervals for future observations (both using command
predict() — details in workshop.
Assessing the explanatory power of the model
Explanatory power often measured using coefficient of determination
or proportion of variance explained:
R2 = 1 – | RSS |
Total Sum of Squares |
where Total Sum of Squares = Pn i=1 Yi – Y 2.
With a single covariate, R2 is the square of the Pearson correlation
coefficient.
R2 can be increased indefinitely by including more and more
covariates: better to adjust for number of covariates in the model and
use adjusted R2:
R2 ADJ = 1 – |
(RSS) / (n – p) (Total Sum of Squares) / (n – 1) . |
Inference and explanatory power in R
Example: linear model for US temperature Use summary() command on an object of class lm.
summary(Temp.Model1)
Call:
lm(formula = min.temp ~ latitude + longitude, data = ustemp)
Residuals:
Min | 1Q | Median 0.5577 |
3Q | Max |
-12.9983 -3.8957 | 3.7330 22.0113 | |||
Coefficients: | ||||
Estimate Std. Error t value Pr(>|t|) | ||||
(Intercept) 98.64523 | 8.32708 11.846 0.17570 -12.314 |
<2e-16 *** <2e-16 *** 0.0386 * |
||
latitude longitude — |
-2.16355 | |||
0.13396 | 0.06314 | 2.122 | ||
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 Residual standard error: 6.935 on 53 degrees of freedom |
||||
Multiple R-squared: 0.7411, | Adjusted R-squared: 0.7314 |
F-statistic: 75.88 on 2 and 53 DF, p-value: 2.792e-16
Residual standard error is estimate of σ
53 degrees of freedom is n – k (56 observations, 3 coefficients
estimated)
Interactions in regression models
With two covariates, suppose Yi = β0 + β1xi1 + β2xi2 + εi.
Suppose also that x2 modulates effect of x1: β1 = γ0 + γ1xi2 (NB
should therefore write β1i). Then
Yi = β0 + (γ0 + γ1xi2)xi1 + β2xi2 + εi
= β0 + γ0xi1 + β2xi2 + γ1xi1xi2 + εi .
Easily handled: just define extra covariate x1x2.
Defining interactions in R |
Use “:” in model formula e.g. lm(y ~ x1 + x2 + x1:x2 ) |
Interpretation of interaction: slope of (x1, y) relationship depends
on value of x2 (or vice versa)
NB: usually, if a model includes an interaction x1:x2 then
corresponding main effects x1 and x2 should also be included
Comparing models
Often want to compare two nested models, M0 & M1 say, where
M0 is a special case of M1 (e.g. obtained by dropping one or more
terms)
Equivalently, where M1 is obtained by adding m extra terms to M0
Can test hypothesis H0: data were generated from M0 using
F-statistic :
F = (RSS0 – RSS1) /m
RSS1/(n – k)
where k is # of coefficients estimated in M1
Test statistic has F distribution under H0
Use anova() command in R e.g. anova(model1, model2,
test=”F”)
NB if just one term is dropped (m = 1), gives same result as t-test
for corresponding coefficient in M1
NB also: models must be fitted to identical response data
Extending the US temperature model
Exploratory analysis indicated that temperature variation isn’t exactly
linear — maybe better described by a quadratic function of latitude
and longitude?
Quadratic function of two variables x1 and x2 has terms in x1 , x2 ,
x2
1 , x22 and x1x2
x1x2 term is just an interaction!
Extending the model: specimen R code (more in workshop)
> Temp.Model2 <- update(Temp.Model1, . ~ . +
+ I(latitude^2) + I(longitude^2) +
+ latitude:longitude)
> anova(Temp.Model1, Temp.Model2, test=”F”)
Analysis of Variance Table
Model 1: min.temp ~ latitude + longitude
Model 2: min.temp ~ latitude + longitude + I(latitude^2) + I(longitude^
latitude:longitude
Res.Df | RSS Df Sum of Sq | F | Pr(>F) |
1 | 53 2548.65 |
2 50 830.72 3 1717.9 34.467 3.197e-12 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Model checking
Recall model assumptions: errors εi have zero mean, constant
variance, normal distribution and are independent
Estimate errors using residuals ei : i = 1, . . . , n, where
ei = Yi – hβˆ0 + βˆ1xi1 + . . . + βˆpxipi
When n k, residuals should behave like errors ⇒ use plots of
residuals to assess validity of assumptions:
1 Plot residuals against fitted values: should show random scatter about
zero with no systematic structure in either mean or variability
2 Scale-location plot: shows absolute values of residuals against fitted
values (easier to see non-constant variance)
3 Normal Q-Q plot to assess normality assumption
4 May add smooth curve to plots to help interpretation (particularly
useful for large datasets — but NB curve is subject to sampling
variability so don’t overinterpret it)
Independence: context-dependent e.g. unlikely to hold for data
collected at successive time points
Consequences of assumption failure
Systematic structure in mean residuals: structural component of
model is inadequate (does it matter? Depends on application and
magnitude of problem)
Non-constant variance: inefficient estimation (doesn’t make best
use of data), incorrect standard errors etc.
Non-normal errors: not critical in large samples except when
computing prediction intervals (see later)
Lack of independence: incorrect standard errors etc.
Another pitfall: influential observations
Outlying observations can potentially have
big impact on least-squares estimates:
Outliers in y-direction (large residuals) ‘pull’
regression line towards them
Outliers in x-direction (‘leverage points’)
encourage regression line to pass close to
them
Influence of each observation measured by
Cook’s distance
Measure of change in least-squares
estimates if observation were omitted
Rule-of-thumb: observation influential if
Cook’s distance exceeds 8/(n – 2k) where
k is # of coefficients estimated
Options for resolving problems: see workshop
Illustration
●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
● ●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
●●
●
x
y
Least-squares fits
Using all data points
Omitting high leverage point
0 10 20 30 40
0 1 2 3 4 5 6
Obs. number
Cook’s distance
Cook’s distance
41
7 29
Model checking in R: use plot()
‘plot()‘ for an object of class ‘lm‘ generates variety of residual plots
Diagnostics for quadratic US temperature model
10 20 30 40 50 60
-5 0 5 10
Fitted values
Residuals
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
Residuals vs Fitted
25
32
52
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
-2 -1 0 1 2
-2 0 1 2
Theoretical Quantiles
Standardized residuals
Normal Q-Q
25
52
32
10 20 30 40 50 60
0.0 0.5 1.0 1.5
Fitted values
Standardized residuals
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
Scale-Location
25
52
32
0 10 20 30 40 50
0.00 0.15 0.30
Obs. number
Cook’s distance
Cook’s distance
52
5
13
ANOVA models
ANOVA models
Regression models represent variation of response with continuous
covariates: what about factor covariates i.e. variation of response in
different groups?
Example: petal lengths in the iris data (Workshop 1) Response variable: petal length |
Covariate: species (Setosa, Versicolor or Virginica)
‘Classical’ approach: Analysis of Variance (ANOVA)
Decompose variation into ‘between-groups’ and ‘within-groups’ sum of
squares by computing overall mean and group means
F-test of null hypothesis of equal mean responses in each group
Presented in ANOVA table: sums of squares, mean squares, degrees of
freedom etc.
ANOVA in R
Command for classical ANOVA is aov — use summary() to get
classical ANOVA table
Example: petal lengths for the iris species
> data(iris)
> summary(aov(Petal.Length ~ Species, data=iris))
Df Sum Sq Mean Sq F value Pr(>F)
Species | 2 437.1 218.55 | 1180 <2e-16 *** | |
Residuals | 147 | 27.2 | 0.19 |
— |
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
BUT · · ·
ANOVA models are linear regressions!
Example: petal lengths for the iris species Let: |
Yi be petal length for the ith specimen
x1i = 1 if the ith specimen is ‘Setosa‘, 0 otherwise
x2i = 1 if the ith specimen is ‘Versicolor‘, 0 otherwise
x3i = 1 if the ith specimen is ‘Virginica‘, 0 otherwise
Consider the linear model Yi = β1x1i + β2x2i + β3x3i + εi (NB no
intercept)
For Setosa specimens we have Yi = β1 + εi ⇒ E (Yi) = β1
For Versicolorand Virginica specimens we have E (Yi) = β2 and
E (Yi) = β3 respectively.
Least-squares coefficient estimates are sample means within each
group
For a factor covariate with L levels, R generates L binary ’dummy
variables’ to represent effect in regression models
L = 3 for ‘Species‘ in iris data
Demonstration of equivalence for iris data
> tapply(iris$Petal.Length, INDEX=iris$Species, FUN=mean)
setosa versicolor virginica
1.462 4.260 5.552
> lm(Petal.Length ~ Species – 1, data=iris)
Call:
lm(formula = Petal.Length ~ Species – 1, data = iris)
Coefficients:
Speciessetosa Speciesversicolor | Speciesvirginica 5.552 |
1.462 | 4.260 |
NB use of -1 in formula to exclude the intercept
Can use the model.matrix() command to see the design matrix
(i.e. X) that R has used — see workshop.
ANOVA models with an intercept
Iris data: petal lengths again
> iris.Model1 <- lm(Petal.Length ~ Species, data=iris)
> summary(iris.Model1)
Call:
lm(formula = Petal.Length ~ Species, data = iris)
Residuals:
Min | 1Q Median | 3Q | Max | |
-1.260 -0.258 0.038 0.240 1.348 | ||||
Coefficients: | ||||
Estimate Std. Error t value Pr(>|t|) | ||||
(Intercept) | 1.46200 | 0.06086 0.08607 0.08607 |
24.02 32.51 47.52 |
<2e-16 *** <2e-16 *** <2e-16 *** |
Speciesversicolor 2.79800 | ||||
Speciesvirginica | 4.09000 |
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 0.4303 on 147 degrees of freedom
Multiple R-squared: 0.9414, Adjusted R-squared: 0.9406
F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16
NB also no coefficient for Setosa now — but intercept is equal to
‘Setosa‘ coefficient in previous model
Other coefs represent differences between other species and Setosa
ANOVA models with an intercept — what’s going on?
With intercept in model for iris data, design matrix X has
First column: all 1’s
Column 2: values of xi1 i.e. 1 for ‘Setosa‘, 0 otherwise
Column 3: values of xi2 i.e. 1 for ‘Versicolor‘, 0 otherwise
Column 4: values of xi3 i.e. 1 for ‘Virginica‘, 0 otherwise
NB sum of columns 2–4 is always 1: so column 1 is linear
combination of columns 2–4 & columns are linearly dependent.
Consequence: (X0X) isn’t invertible ⇒ least-squares estimate
βˆ = (X0X)-1 X0Y isn’t unique.
Solution to problem: impose constraint to remove linear
dependence
Default in R is to set βj = 0 for one level of a factor — Setosa here
Iris model becomes Yij = β0 + β2xi2 + β3xi3 + εij so for Setosa we
have E(Yi) = β0, for Versicolor & Virginica we have
E(Yi) = β0 + β2 & E(Yi) = β0 + β3 respectively.
Factor covariates: what you need to know
To include a factor with L levels as a covariate in a linear regression
model, L – 1 coefficients will be estimated (‘L – 1 degrees of
freedom’)
By default, coefficient for ’reference level’ is set to zero so that other
coefficients are differences relative to this reference level (e.g. Setosa
in iris example)
Other constraints also possible (see workshop) — but choice of
constraint does not affect the fitted values (i.e. estimates of E(Yi))
NB Care required when interpreting p-values in ‘summary‘:
interpretation of coefficients depends on constraint that has been
applied
Factor covariates can be included alongside continuous covariates in
linear regression models
Interactions between factor and continuous covariates imply different
regression slopes within each group (see workshop)
- Assignment status: Already Solved By Our Experts
- (USA, AUS, UK & CA PhD. Writers)
- CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS
