STAT0023 Week 2

Linear regression and ANOVA in R

Ioanna Manolopoulou and Richard Chandler

Regression

Regression models

Recap: regression basics (STAT0006 or equivalent) | |

Data: Yi, x1i, x2i, . . . , xpi : i = 1, . . . , n Y is response variable; x’s are covariates Model: Yi = β0 + β1x1i + . . . + βpxpi + εi , |
εi ∼ N(0, σ2) |

AssignmentTutorOnline

Linear relationship between response and covariates

‘Errors’ εi are independent and normally distributed, with zero

mean and constant variance

β0 is intercept (expected response when all covariates are zero)

βj is regression coefficient (expected change in response when jth

covariate increases by 1 unit)

Estimation in regression models

Regression coefficients β0, β1, . . . , βp via least squares:

Choose parameter estimates βˆ0, βˆ1, . . . , βˆp to minimise sum of squares

SS =

nXi=1

(Yi – [β0 + β1x1i + . . . + βpxip])2

Minimised value is residual sum of squares (RSS)

Error variance σ2 as σˆ2 = RSS/(n – k) where k = p + 1

n – k is residual degrees of freedom (# of observations minus # of

coefficients estimated)

Regression in R

Command is lm (‘linear model’)

Model structure specified using a formula:

R formula notation

Generic form: Y ~ x1 + x2 + x3

Y is the response variable

x1, x2, x3 are the covariates

So, to regress Y on x1, x2 and x3: lm(Y x1 + x2 + x3)

Example: US Minimum Temperature dataset

Contains average daily minimum temperature in January

at 56 US cities over the time period 1931-1960.

Structure: city name, temperature, latitude (◦W of

Greenwich) and longitude (◦N of the Equator):

Temperature is measured in degrees Fahrenheit.

Thermometer image from https://www.weather-station-products.co.uk

> head(ustemp)

city min.temp latitude longitude

1 Mobile, AL 44 31.2 88.5

2 Montgomery, AL 38 32.9 86.8

3 Phoenix, AZ 35 33.6 112.5

4 Little Rock, AR 31 35.4 92.8

5 Los Angeles, CA 47 34.3 118.7

6 San Francisco, CA 42 38.4 123.0

US temperatures: visualising the data

°F

(0,10]

(10,20]

(20,30]

(30,40]

(40,50]

(50,60]

(60,70]

Mean US January daily minimum temperature, 1931-1960

Colour scale divides temperature range into equal intervals (NB

careful choice of colours — see comments in script during workshop).

Shows roughly linear (planar) variation of minimum temperature with

latitude and longitude, of the form

Temperature ≈ β0 + (β1 × Latitude) + (β2 × Longitude)

US temperatures: fitting a linear model

> Temp.Model1 <-

+ lm(min.temp ~ latitude + longitude, data=ustemp)

> print(Temp.Model1)

Call:

lm(formula = min.temp ~ latitude + longitude, data = ustemp)

Coefficients:

(Intercept) latitude longitude

98.645 -2.164 0.134

Note that Temp.Model1 is an object of class ‘lm‘, hence print()

produces useful information about the fitted model (object classes

were covered in Workshop 1).

Regression models: more theory

Matrix representation of the linear regression model |

Model for all observations can equivalently be written in matrix form as |

Y1

Y2…Yn

=

1 x11 x12 · · · x1p

1 x21 x22 · · · x2p

…

…

…

.

.

.

…

1 xn1 xn2 · · · xnp

β0

β1

…β

p

+

ε1

ε2

…ε

n

or

Y

|z

n×1

= X

|z

n×(p+1)

β

|z

(p+1)×1

+ ε

|z

n×1

X is design matrix for the model

Inference about the coefficients

Least-squares estimator of coefficient vector β can be written as

βˆ = (X0X)-1 X0Y.

Under model assumptions, βˆ has multivariate normal distribution with

mean vector β and covariance matrix σ2 (X0X)-1.

Standard errors of coefficient estimators are square roots of diagonal

elements of covariance matrix

Need to replace σ2 with its estimate σˆ2 = RSS/(n – k) where k is #

of coefficients estimated

To test H0 : βj = 0, use test statistic tj = βˆj/s.e. ˆ (βj) and compare

with t distribution on n – k degrees of freedom.

NB beware collinearity: if covariates are highly correlated then test

results can be sensitive to other covariates in model.

Can also use to derive confidence intervals for individual coefficients

(command confint(), confidence intervals for the regression line

and prediction intervals for future observations (both using command

predict() — details in workshop.

Assessing the explanatory power of the model

Explanatory power often measured using coefficient of determination

or proportion of variance explained:

R2 = 1 – | RSS |

Total Sum of Squares |

where Total Sum of Squares = Pn i=1 Yi – Y 2.

With a single covariate, R2 is the square of the Pearson correlation

coefficient.

R2 can be increased indefinitely by including more and more

covariates: better to adjust for number of covariates in the model and

use adjusted R2:

R2 ADJ = 1 – |
(RSS) / (n – p) (Total Sum of Squares) / (n – 1) . |

Inference and explanatory power in R

Example: linear model for US temperature Use summary() command on an object of class lm.

summary(Temp.Model1)

Call:

lm(formula = min.temp ~ latitude + longitude, data = ustemp)

Residuals:

Min | 1Q | Median 0.5577 |
3Q | Max |

-12.9983 -3.8957 | 3.7330 22.0113 | |||

Coefficients: | ||||

Estimate Std. Error t value Pr(>|t|) | ||||

(Intercept) 98.64523 | 8.32708 11.846 0.17570 -12.314 |
<2e-16 *** <2e-16 *** 0.0386 * |
||

latitude longitude — |
-2.16355 | |||

0.13396 | 0.06314 | 2.122 | ||

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 Residual standard error: 6.935 on 53 degrees of freedom |
||||

Multiple R-squared: 0.7411, | Adjusted R-squared: 0.7314 |

F-statistic: 75.88 on 2 and 53 DF, p-value: 2.792e-16

Residual standard error is estimate of σ

53 degrees of freedom is n – k (56 observations, 3 coefficients

estimated)

Interactions in regression models

With two covariates, suppose Yi = β0 + β1xi1 + β2xi2 + εi.

Suppose also that x2 modulates effect of x1: β1 = γ0 + γ1xi2 (NB

should therefore write β1i). Then

Yi = β0 + (γ0 + γ1xi2)xi1 + β2xi2 + εi

= β0 + γ0xi1 + β2xi2 + γ1xi1xi2 + εi .

Easily handled: just define extra covariate x1x2.

Defining interactions in R |

Use “:” in model formula e.g. lm(y ~ x1 + x2 + x1:x2 ) |

Interpretation of interaction: slope of (x1, y) relationship depends

on value of x2 (or vice versa)

NB: usually, if a model includes an interaction x1:x2 then

corresponding main effects x1 and x2 should also be included

Comparing models

Often want to compare two nested models, M0 & M1 say, where

M0 is a special case of M1 (e.g. obtained by dropping one or more

terms)

Equivalently, where M1 is obtained by adding m extra terms to M0

Can test hypothesis H0: data were generated from M0 using

F-statistic :

F = (RSS0 – RSS1) /m

RSS1/(n – k)

where k is # of coefficients estimated in M1

Test statistic has F distribution under H0

Use anova() command in R e.g. anova(model1, model2,

test=”F”)

NB if just one term is dropped (m = 1), gives same result as t-test

for corresponding coefficient in M1

NB also: models must be fitted to identical response data

Extending the US temperature model

Exploratory analysis indicated that temperature variation isn’t exactly

linear — maybe better described by a quadratic function of latitude

and longitude?

Quadratic function of two variables x1 and x2 has terms in x1 , x2 ,

x2

1 , x22 and x1x2

x1x2 term is just an interaction!

Extending the model: specimen R code (more in workshop)

> Temp.Model2 <- update(Temp.Model1, . ~ . +

+ I(latitude^2) + I(longitude^2) +

+ latitude:longitude)

> anova(Temp.Model1, Temp.Model2, test=”F”)

Analysis of Variance Table

Model 1: min.temp ~ latitude + longitude

Model 2: min.temp ~ latitude + longitude + I(latitude^2) + I(longitude^

latitude:longitude

Res.Df | RSS Df Sum of Sq | F | Pr(>F) |

1 | 53 2548.65 |

2 50 830.72 3 1717.9 34.467 3.197e-12 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Model checking

Recall model assumptions: errors εi have zero mean, constant

variance, normal distribution and are independent

Estimate errors using residuals ei : i = 1, . . . , n, where

ei = Yi – hβˆ0 + βˆ1xi1 + . . . + βˆpxipi

When n k, residuals should behave like errors ⇒ use plots of

residuals to assess validity of assumptions:

1 Plot residuals against fitted values: should show random scatter about

zero with no systematic structure in either mean or variability

2 Scale-location plot: shows absolute values of residuals against fitted

values (easier to see non-constant variance)

3 Normal Q-Q plot to assess normality assumption

4 May add smooth curve to plots to help interpretation (particularly

useful for large datasets — but NB curve is subject to sampling

variability so don’t overinterpret it)

Independence: context-dependent e.g. unlikely to hold for data

collected at successive time points

Consequences of assumption failure

Systematic structure in mean residuals: structural component of

model is inadequate (does it matter? Depends on application and

magnitude of problem)

Non-constant variance: inefficient estimation (doesn’t make best

use of data), incorrect standard errors etc.

Non-normal errors: not critical in large samples except when

computing prediction intervals (see later)

Lack of independence: incorrect standard errors etc.

Another pitfall: influential observations

Outlying observations can potentially have

big impact on least-squares estimates:

Outliers in y-direction (large residuals) ‘pull’

regression line towards them

Outliers in x-direction (‘leverage points’)

encourage regression line to pass close to

them

Influence of each observation measured by

Cook’s distance

Measure of change in least-squares

estimates if observation were omitted

Rule-of-thumb: observation influential if

Cook’s distance exceeds 8/(n – 2k) where

k is # of coefficients estimated

Options for resolving problems: see workshop

Illustration

●

●

●

●

●

● ●

●

● ● ●

●

●

●

●

● ●

●

●

● ●

●

● ● ●

●

●

●

●

●

●

●

●

● ●

●

● ●

●●

●

x

y

Least-squares fits

Using all data points

Omitting high leverage point

0 10 20 30 40

0 1 2 3 4 5 6

Obs. number

Cook’s distance

Cook’s distance

41

7 29

Model checking in R: use plot()

‘plot()‘ for an object of class ‘lm‘ generates variety of residual plots

Diagnostics for quadratic US temperature model

10 20 30 40 50 60

-5 0 5 10

Fitted values

Residuals

●

●

●

●

● ●

●

● ● ●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

● ●

●

●

●

● ●

●

●

● ●

●

Residuals vs Fitted

25

32

52

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

-2 -1 0 1 2

-2 0 1 2

Theoretical Quantiles

Standardized residuals

Normal Q-Q

25

52

32

10 20 30 40 50 60

0.0 0.5 1.0 1.5

Fitted values

Standardized residuals

●

●

●

●

● ●

●

● ● ●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

● ●

●

●

●

● ●

●

● ●

●

●

●

●

●

●

● ●

●

●

Scale-Location

25

52

32

0 10 20 30 40 50

0.00 0.15 0.30

Obs. number

Cook’s distance

Cook’s distance

52

5

13

ANOVA models

ANOVA models

Regression models represent variation of response with continuous

covariates: what about factor covariates i.e. variation of response in

different groups?

Example: petal lengths in the iris data (Workshop 1) Response variable: petal length |

Covariate: species (Setosa, Versicolor or Virginica)

‘Classical’ approach: Analysis of Variance (ANOVA)

Decompose variation into ‘between-groups’ and ‘within-groups’ sum of

squares by computing overall mean and group means

F-test of null hypothesis of equal mean responses in each group

Presented in ANOVA table: sums of squares, mean squares, degrees of

freedom etc.

ANOVA in R

Command for classical ANOVA is aov — use summary() to get

classical ANOVA table

Example: petal lengths for the iris species

> data(iris)

> summary(aov(Petal.Length ~ Species, data=iris))

Df Sum Sq Mean Sq F value Pr(>F)

Species | 2 437.1 218.55 | 1180 <2e-16 *** | |

Residuals | 147 | 27.2 | 0.19 |

— |

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

BUT · · ·

ANOVA models are linear regressions!

Example: petal lengths for the iris species Let: |

Yi be petal length for the ith specimen

x1i = 1 if the ith specimen is ‘Setosa‘, 0 otherwise

x2i = 1 if the ith specimen is ‘Versicolor‘, 0 otherwise

x3i = 1 if the ith specimen is ‘Virginica‘, 0 otherwise

Consider the linear model Yi = β1x1i + β2x2i + β3x3i + εi (NB no

intercept)

For Setosa specimens we have Yi = β1 + εi ⇒ E (Yi) = β1

For Versicolorand Virginica specimens we have E (Yi) = β2 and

E (Yi) = β3 respectively.

Least-squares coefficient estimates are sample means within each

group

For a factor covariate with L levels, R generates L binary ’dummy

variables’ to represent effect in regression models

L = 3 for ‘Species‘ in iris data

Demonstration of equivalence for iris data

> tapply(iris$Petal.Length, INDEX=iris$Species, FUN=mean)

setosa versicolor virginica

1.462 4.260 5.552

> lm(Petal.Length ~ Species – 1, data=iris)

Call:

lm(formula = Petal.Length ~ Species – 1, data = iris)

Coefficients:

Speciessetosa Speciesversicolor | Speciesvirginica 5.552 |

1.462 | 4.260 |

NB use of -1 in formula to exclude the intercept

Can use the model.matrix() command to see the design matrix

(i.e. X) that R has used — see workshop.

ANOVA models with an intercept

Iris data: petal lengths again

> iris.Model1 <- lm(Petal.Length ~ Species, data=iris)

> summary(iris.Model1)

Call:

lm(formula = Petal.Length ~ Species, data = iris)

Residuals:

Min | 1Q Median | 3Q | Max | |

-1.260 -0.258 0.038 0.240 1.348 | ||||

Coefficients: | ||||

Estimate Std. Error t value Pr(>|t|) | ||||

(Intercept) | 1.46200 | 0.06086 0.08607 0.08607 |
24.02 32.51 47.52 |
<2e-16 *** <2e-16 *** <2e-16 *** |

Speciesversicolor 2.79800 | ||||

Speciesvirginica | 4.09000 |

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 0.4303 on 147 degrees of freedom

Multiple R-squared: 0.9414, Adjusted R-squared: 0.9406

F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16

NB also no coefficient for Setosa now — but intercept is equal to

‘Setosa‘ coefficient in previous model

Other coefs represent differences between other species and Setosa

ANOVA models with an intercept — what’s going on?

With intercept in model for iris data, design matrix X has

First column: all 1’s

Column 2: values of xi1 i.e. 1 for ‘Setosa‘, 0 otherwise

Column 3: values of xi2 i.e. 1 for ‘Versicolor‘, 0 otherwise

Column 4: values of xi3 i.e. 1 for ‘Virginica‘, 0 otherwise

NB sum of columns 2–4 is always 1: so column 1 is linear

combination of columns 2–4 & columns are linearly dependent.

Consequence: (X0X) isn’t invertible ⇒ least-squares estimate

βˆ = (X0X)-1 X0Y isn’t unique.

Solution to problem: impose constraint to remove linear

dependence

Default in R is to set βj = 0 for one level of a factor — Setosa here

Iris model becomes Yij = β0 + β2xi2 + β3xi3 + εij so for Setosa we

have E(Yi) = β0, for Versicolor & Virginica we have

E(Yi) = β0 + β2 & E(Yi) = β0 + β3 respectively.

Factor covariates: what you need to know

To include a factor with L levels as a covariate in a linear regression

model, L – 1 coefficients will be estimated (‘L – 1 degrees of

freedom’)

By default, coefficient for ’reference level’ is set to zero so that other

coefficients are differences relative to this reference level (e.g. Setosa

in iris example)

Other constraints also possible (see workshop) — but choice of

constraint does not affect the fitted values (i.e. estimates of E(Yi))

NB Care required when interpreting p-values in ‘summary‘:

interpretation of coefficients depends on constraint that has been

applied

Factor covariates can be included alongside continuous covariates in

linear regression models

Interactions between factor and continuous covariates imply different

regression slopes within each group (see workshop)

- Assignment status: Already Solved By Our Experts
*(USA, AUS, UK & CA PhD. Writers)***CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS**

**NO PLAGIARISM**– CUSTOM PAPER