STAT0023 Week 6
Generalised linear and generalised additive models
Ioanna Manolopoulou and Richard Chandler
Generalised linear models
Recap: multiple linear regression model
Multiple linear regression model specifies
Yi = β0 + β1xi1 + . . . + βpxip + εi
where ‘errors’ εi have zero mean, constant variance, normal
distribution and are independent
Recap: multiple linear regression model
Multiple linear regression model specifies
Yi = β0 + β1xi1 + . . . + βpxip + εi
where ‘errors’ εi have zero mean, constant variance, normal
distribution and are independent
Transformation of variables (e.g. take logs) can sometimes ensure
that assumptions are met — but not always possible / desirable:
May not be possible to find transformation to linearise relationship —
need nonlinear regression model (studied in Week 4)
Models for transformed variables may be hard to interpret in
subjectmatter terms
Transformations may cause numerical problems
Example: numerical problems in a transformation
Revisit Galapagos biodiversity data from Workshop 1, & plot number
of endemic species against island area
Take logs of x and y to linearise and stabilise variance
Example: numerical problems in a transformation
Revisit Galapagos biodiversity data from Workshop 1, & plot number
of endemic species against island area
Take logs of x and y to linearise and stabilise variance
Looks good? Look at the code . . .
> gala.data < read.table(“galapagos.dat”)
> plot(gala.data$Area, gala.data$Endemics) # Other arguments omitted
> plot(gala.data$Area, gala.data$Endemics, log=”xy”)
Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value
<= 0 omitted from logarithmic plot
The linear model: an alternative perspective
Under linear model, (population) mean for Yi is, say
µi = β0 + β1xi1 + . . . + βpxip . . .
The linear model: an alternative perspective
Under linear model, (population) mean for Yi is, say
µi = β0 + β1xi1 + . . . + βpxip . . .
. . . and (population) variance (variation about population mean) is
σ2, variance of εi
The linear model: an alternative perspective
Under linear model, (population) mean for Yi is, say
µi = β0 + β1xi1 + . . . + βpxip . . .
. . . and (population) variance (variation about population mean) is
σ2, variance of εi
Write E (Yi) = µi, Var (Yi) = σ2
The linear model: an alternative perspective
Under linear model, (population) mean for Yi is, say
µi = β0 + β1xi1 + . . . + βpxip . . .
. . . and (population) variance (variation about population mean) is
σ2, variance of εi
Write E (Yi) = µi, Var (Yi) = σ2
Then, linear regression model says:
Given covariates, Yi has normal distribution with parameters µi (mean)
and σ2 (variance)
Suggests generalisation to other distributions . . .
Generalised Linear Models
First step in generalisation: specify that response variables are
drawn from common family of distributions (normal, Poisson,
binomial, gamma, . . .)
Pedantic technical detail: distributions must be in exponential family
for computational and inferential reasons
Generalised Linear Models
First step in generalisation: specify that response variables are
drawn from common family of distributions (normal, Poisson,
binomial, gamma, . . .)
Pedantic technical detail: distributions must be in exponential family
for computational and inferential reasons
Second step: means of distributions related to linear function of
covariates: µi related to β0 + β1xi1 + . . . + βpxip again
Generalised Linear Models
First step in generalisation: specify that response variables are
drawn from common family of distributions (normal, Poisson,
binomial, gamma, . . .)
Pedantic technical detail: distributions must be in exponential family
for computational and inferential reasons
Second step: means of distributions related to linear function of
covariates: µi related to β0 + β1xi1 + . . . + βpxip again
Relationship can be nonlinear: g(µi) = β0 + β1xi1 + . . . + βpxip for
link function g( · )
Can be used to ensure (e.g.) that modelled means are nonnegative by
using log link: log µi = β0 + β1xi1 + . . . + βpxip
Generalised Linear Models
First step in generalisation: specify that response variables are
drawn from common family of distributions (normal, Poisson,
binomial, gamma, . . .)
Pedantic technical detail: distributions must be in exponential family
for computational and inferential reasons
Second step: means of distributions related to linear function of
covariates: µi related to β0 + β1xi1 + . . . + βpxip again
Relationship can be nonlinear: g(µi) = β0 + β1xi1 + . . . + βpxip for
link function g( · )
Can be used to ensure (e.g.) that modelled means are nonnegative by
using log link: log µi = β0 + β1xi1 + . . . + βpxip
Need additional dispersion parameter to specify variances for some
distributions:
E.g. normal distributions: dispersion parameter is σ2 & must be
estimated separately
But for Poisson distributions, variance = mean
Example: Galapagos endemics data
Response variable is count ⇒ natural to
consider model based on Poisson
distribution
For Poisson, variance = mean ⇒ larger
islands should have more scatter in their
data (scatterplot shows this)
Example: Galapagos endemics data
Response variable is count ⇒ natural to
consider model based on Poisson
distribution
For Poisson, variance = mean ⇒ larger
islands should have more scatter in their
data (scatterplot shows this)
Use log link to ensure modelled means are positive: Yi ∼ Poi (µi),
log µi = β0 + β1 log Areai
Note: µi gets transformed, not Yi
Example: Galapagos endemics data
Response variable is count ⇒ natural to
consider model based on Poisson
distribution
For Poisson, variance = mean ⇒ larger
islands should have more scatter in their
data (scatterplot shows this)
Use log link to ensure modelled means are positive: Yi ∼ Poi (µi),
log µi = β0 + β1 log Areai
Note: µi gets transformed, not Yi
More interpretable: µi = eβ0+β1 log Areai ⇒ exponential relationship
between log Area and mean # of endemics (scatterplot shows this
too)
Example: Galapagos endemics data
Response variable is count ⇒ natural to
consider model based on Poisson
distribution
For Poisson, variance = mean ⇒ larger
islands should have more scatter in their
data (scatterplot shows this)
Use log link to ensure modelled means are positive: Yi ∼ Poi (µi),
log µi = β0 + β1 log Areai
Note: µi gets transformed, not Yi
More interpretable: µi = eβ0+β1 log Areai ⇒ exponential relationship
between log Area and mean # of endemics (scatterplot shows this
too)
Even more interpretable: rearrange relationship as µi = A × Areaβ i 1
where A = eβ0
Fitting a GLM in R
Use glm() command instead of lm()
Need family argument to choose distribution, and link to specify
link function
Example:
> galapagos.glm1 < glm(Endemics ~ log(Area),
+ family=poisson(link=”log”),
+ data=gala.data)
summary(), plot(), confint() etc. work exactly as for linear
models
Example (continued):
> summary(galapagos.glm1)
Call:
glm(formula = Endemics ~ log(Area), family = poisson(link = “log”),
data = gala.data)
Deviance Residuals:
Min  1Q  Median  3Q 1.1263 
Max 3.8114 
4.0329 1.9822 0.3976  
Coefficients:  
Estimate Std. Error z value Pr(>z)  
(Intercept) 2.40127  0.06540 0.01192 
36.72 22.96 
<2e16 *** <2e16 *** 

log(Area) — 
0.27373 
AssignmentTutorOnline
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 743.55 on 29 degrees of freedom
Residual deviance: 121.97 on 28 degrees of freedom
AIC: 258.93
Number of Fisher Scoring iterations: 4
Model fitting — how it works
Leastsquares fitting not optimal for GLMs because variance depends
on mean (can tolerate worse fit for largevariance observations)
Better alternative: maximum likelihood
For linear regression model, least squares is maximum likelihood
(proof in selfstudy notes) ⇒ GLMs generalise model, maximum
likelihood generalises estimation
Model fitting — how it works
Leastsquares fitting not optimal for GLMs because variance depends
on mean (can tolerate worse fit for largevariance observations)
Better alternative: maximum likelihood
For linear regression model, least squares is maximum likelihood
(proof in selfstudy notes) ⇒ GLMs generalise model, maximum
likelihood generalises estimation
Theory allows calculation of standard errors, confidence intervals and
hypothesis tests based on Fisher information (see Week 4, STAT0005
etc.)
Model fitting — how it works
Leastsquares fitting not optimal for GLMs because variance depends
on mean (can tolerate worse fit for largevariance observations)
Better alternative: maximum likelihood
For linear regression model, least squares is maximum likelihood
(proof in selfstudy notes) ⇒ GLMs generalise model, maximum
likelihood generalises estimation
Theory allows calculation of standard errors, confidence intervals and
hypothesis tests based on Fisher information (see Week 4, STAT0005
etc.)
For GLMs, algorithm used to find MLE is iterative (re)weighted least
squares or ‘Fisher scoring’ — highest weight to cases with smallest
modelled variances
Algorithm is slight modification of NewtonRaphson (Week 4)
Comparing models
Recap: in multiple linear regression, nested models M0 and M1 can
be compared using Ftest based on residual sums of squares
Comparing models
Recap: in multiple linear regression, nested models M0 and M1 can
be compared using Ftest based on residual sums of squares
For GLMs, deviance generalises RSS: proportional to difference
between maximised loglikelihoods: `1 – `0 say for models
Deviance = RSS in linear regression models
Comparing models
Recap: in multiple linear regression, nested models M0 and M1 can
be compared using Ftest based on residual sums of squares
For GLMs, deviance generalises RSS: proportional to difference
between maximised loglikelihoods: `1 – `0 say for models
Deviance = RSS in linear regression models
For models without dispersion parameter, deviance statistic has
chisquared distribution under H0: data generated from simpler model
M0
Comparing models
Recap: in multiple linear regression, nested models M0 and M1 can
be compared using Ftest based on residual sums of squares
For GLMs, deviance generalises RSS: proportional to difference
between maximised loglikelihoods: `1 – `0 say for models
Deviance = RSS in linear regression models
For models without dispersion parameter, deviance statistic has
chisquared distribution under H0: data generated from simpler model
M0
For models with dispersion parameter, deviance statistic has F
distribution as in normal case
Comparing models
Recap: in multiple linear regression, nested models M0 and M1 can
be compared using Ftest based on residual sums of squares
For GLMs, deviance generalises RSS: proportional to difference
between maximised loglikelihoods: `1 – `0 say for models
Deviance = RSS in linear regression models
For models without dispersion parameter, deviance statistic has
chisquared distribution under H0: data generated from simpler model
M0
For models with dispersion parameter, deviance statistic has F
distribution as in normal case
In R, use anova() as for linear models
Comparing models
Recap: in multiple linear regression, nested models M0 and M1 can
be compared using Ftest based on residual sums of squares
For GLMs, deviance generalises RSS: proportional to difference
between maximised loglikelihoods: `1 – `0 say for models
Deviance = RSS in linear regression models
For models without dispersion parameter, deviance statistic has
chisquared distribution under H0: data generated from simpler model
M0
For models with dispersion parameter, deviance statistic has F
distribution as in normal case
In R, use anova() as for linear models
Can also use information criteria e.g. Akaike (AIC), Bayes (BIC):
All of form 2` + Kp where ` is loglikelihood and p is # of
coefficients — balance model fit against complexity, low values better
(more in selfstudy materials)
Model checking
Model checking follows same ideas as for linear models
Difference: residuals not uniquely defined for GLMs
Model checking
Model checking follows same ideas as for linear models
Difference: residuals not uniquely defined for GLMs
Some possibilities:
Deviance residuals: defined by analogy with RSS so that deviance is
sum of squared residuals
Pearson residuals: defined to be proportional to (yi – µi)/σi where
µi & σi are mean & std dev under fitted model — should all have
mean zero and constant variance
Model checking
Model checking follows same ideas as for linear models
Difference: residuals not uniquely defined for GLMs
Some possibilities:
Deviance residuals: defined by analogy with RSS so that deviance is
sum of squared residuals
Pearson residuals: defined to be proportional to (yi – µi)/σi where
µi & σi are mean & std dev under fitted model — should all have
mean zero and constant variance
plot() command in R uses deviance residuals:
Can be unhelpful if data are highly discrete, particularly Q – Q plot
Example: Galapagos endemics model again
Residual plots
Some influential observations
(Cook’s distance > 8/(n – 2k)
= 8/26 ‘ 0.3)
‘Residuals vs fitted’ plot shows
several residuals > 2 in magnitude
— suggests overdispersion i.e.
variance larger than expected for
Poisson model
Possible solution to overdispersion: quasiPoisson modelling (see
selfstudy materials)
Prediction with GLMs
Confidence intervals can be computed for fitted values as for linear
model
predict() in R produces confidence intervals for linear predictor
β0 + β1xi1 + . . . + βpxip = g(µi) — backtransform to get intervals
for µi itself
Prediction with GLMs
Confidence intervals can be computed for fitted values as for linear
model
predict() in R produces confidence intervals for linear predictor
β0 + β1xi1 + . . . + βpxip = g(µi) — backtransform to get intervals
for µi itself
Prediction intervals for new observations are more difficult (technical
reasons) — no simple solution, need advanced computational
techniques
Generalised additive models
What can’t you do with a GLM?
An artificial example
Possible model for relationship:
Yi = µ(xi) + εi
Regression function µ( · ) is
nonlinear and follows no obvious
mathematical form — can’t use
linear or nonlinear regression
What can’t you do with a GLM?
An artificial example
Possible model for relationship:
Yi = µ(xi) + εi
Regression function µ( · ) is
nonlinear and follows no obvious
mathematical form — can’t use
linear or nonlinear regression
Example is candidate for nonparametric regression methods
‘Nonparametric’ ⇒ no specific form assumed for the regression
function µ( · )
But reasonable to assume regression function is ’smooth’
Basis representation of smooth functions
Remember: any ’sufficiently
wellbehaved’ function can be
represented using polynomials, so
µ(x) = P∞ `=0 β`x` say.
Suggests regressing Y on
x, x2, x3, . . . , xM for some large M i.e.
use polynomial basis representation of
µ( · ).
Basis representation of smooth functions
Remember: any ’sufficiently
wellbehaved’ function can be
represented using polynomials, so
µ(x) = P∞ `=0 β`x` say.
Suggests regressing Y on
x, x2, x3, . . . , xM for some large M i.e.
use polynomial basis representation of
µ( · ).
Disadvantage: polynomials typically collinear (i.e. highly correlated)
and need many polynomials to capture ‘local’ features accurately
Idea suggests use of alternative bases that are more flexible
Basis representation using splines
Splines are flexible sets of basis functions with ’good’ properties
Several different types available:
regression splines, Bsplines, Psplines
etc.
Given spline basis φ1(x), φ2(x),
φ3(x) . . ., regression function µ( · )
can often be approximated well as
µ(x) ≈ P` β`φ`(x)
Suggests regressing on splines
φ1(x), φ2(x), φ3(x) . . . instead of
polynomials
Splines can track local features
better — each one associated with
‘knots’ determining location
In practice, usually use larger basis than shown in this example (fine
grid of knots)
Nonparametric regression for smooth functions
Recap:
Model is Yi = µ(xi) + εi with µ(x) ≈ P` β`φ`(x) for spline basis
φ1(x), φ2(x), φ3(x), . . .
To fit model, need to estimate coefficients β`
Leastsquares is ‘obvious’ approach, but large number of coefficients
risks overfitting (e.g. may have more coefficients than observations)
Nonparametric regression for smooth functions
Recap:
Model is Yi = µ(xi) + εi with µ(x) ≈ P` β`φ`(x) for spline basis
φ1(x), φ2(x), φ3(x), . . .
To fit model, need to estimate coefficients β`
Leastsquares is ‘obvious’ approach, but large number of coefficients
risks overfitting (e.g. may have more coefficients than observations)
Solution: use penalised least squares i.e. minimise
SS
p(β1, β2, . . .) =
nXi=1
[Yi – µ(xi))]2 + λ ZR [µ00(x)]2 dx
Second term is ’penalty’: integrated square of 2nd derivative of µ(x)
as measure of ’wiggliness’ i.e. overfitting
λ ≥ 0 is smoothing parameter controlling strength of penalty: set to
zero for no penalty, increase for greater smoothness
Penalised estimation for example regression problem
Plots show data and estimates of µ(x) in model Yi = µ(xi) + εi with
εi ∼ N(0, σ2) independently for i = 1, . . . , n
µ(x) represented using cubic
regression spline with 20 knots
Coefficients estimated by
minimising Pn i=1 [Yi – µ(xi)]2 +
λ RR [µ00(x)]2 dx, for
λ = 0, 100, 10 000
λ = 0 corresponds to no penalty:
undersmoothing (≡ overfitting)
λ → ∞ forces µ(x) to be linear:
possibly oversmoothed
Smoothing parameter selection
Need to choose smoothing parameter λ to control balance of
smoothness vs fit to data
Too large: estimate is too smooth ⇒ biased estimate of regression
function
Too small: estimate is too wiggly ⇒ highly variable estimate of
regression function
Tradeoff between bias and variance
Standard approach: choose smoothing parameter to minimise
generalised crossvalidation (GCV) criterion which is approximate
mean squared error when predicting each observation using model
fitted to the remaining n – 1 observations
Generalised additive models (GAMs)
Splinebased nonparametric regression can be extended to multiple
covariates and nonnormal distributions.
Generalised additive models (GAMs)
Splinebased nonparametric regression can be extended to multiple
covariates and nonnormal distributions.
General framework:
Yi has distribution in exponential family (as with GLM)
E(Yi) = µi and g(µi) = η0 + η1(xi1) + η2(xi2) + . . . + ηp(xip) for link
function g( · ) and smooth functions η1( · ), η2( · ), . . . , ηp( · ).
Generalised additive models (GAMs)
Splinebased nonparametric regression can be extended to multiple
covariates and nonnormal distributions.
General framework:
Yi has distribution in exponential family (as with GLM)
E(Yi) = µi and g(µi) = η0 + η1(xi1) + η2(xi2) + . . . + ηp(xip) for link
function g( · ) and smooth functions η1( · ), η2( · ), . . . , ηp( · ).
Spline basis representations of smooth functions ηj( · )
Generalised additive models (GAMs)
Splinebased nonparametric regression can be extended to multiple
covariates and nonnormal distributions.
General framework:
Yi has distribution in exponential family (as with GLM)
E(Yi) = µi and g(µi) = η0 + η1(xi1) + η2(xi2) + . . . + ηp(xip) for link
function g( · ) and smooth functions η1( · ), η2( · ), . . . , ηp( · ).
Spline basis representations of smooth functions ηj( · )
Spline coefficients estimated using penalised leastsquares or penalised
maximimum likelihood: smoothing parameters selected using
generalised crossvalidation or similar methods
Generalised additive models (GAMs)
Splinebased nonparametric regression can be extended to multiple
covariates and nonnormal distributions.
General framework:
Yi has distribution in exponential family (as with GLM)
E(Yi) = µi and g(µi) = η0 + η1(xi1) + η2(xi2) + . . . + ηp(xip) for link
function g( · ) and smooth functions η1( · ), η2( · ), . . . , ηp( · ).
Spline basis representations of smooth functions ηj( · )
Spline coefficients estimated using penalised leastsquares or penalised
maximimum likelihood: smoothing parameters selected using
generalised crossvalidation or similar methods
NB ’additive’ structure: g(µi) = η0 + Pp j=1 ηj(xij) — hence
‘generalised additive model’.
Generalised additive models (GAMs)
Splinebased nonparametric regression can be extended to multiple
covariates and nonnormal distributions.
General framework:
Yi has distribution in exponential family (as with GLM)
E(Yi) = µi and g(µi) = η0 + η1(xi1) + η2(xi2) + . . . + ηp(xip) for link
function g( · ) and smooth functions η1( · ), η2( · ), . . . , ηp( · ).
Spline basis representations of smooth functions ηj( · )
Spline coefficients estimated using penalised leastsquares or penalised
maximimum likelihood: smoothing parameters selected using
generalised crossvalidation or similar methods
NB ’additive’ structure: g(µi) = η0 + Pp j=1 ηj(xij) — hence
‘generalised additive model’.
Confidence bands etc. available for estimated smooth functions, but
are approximate
Model comparisons (e.g. F tests) also approximate — approximations
can sometimes be poor due to automatic smoothing parameter
selection.
GAMs in R
All implemented via gam() command in mgcv library.
Example regression problem, revisited
> gam.example < gam(y ~ s(x)) # s is “smooth”
> summary(gam.example)
Family: gaussian
Link function: identity
Formula:
y ~ s(x)
Parametric coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 0.02906  0.02070  1.404  0.189 
Approximate significance of smooth terms:  
edf Ref.df  F pvalue 
s(x) 8.308 8.858 17.33 2.68e05 ***
—
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Rsq.(adj) = 0.888 Deviance explained = 93.7%
GCV = 0.016029 Scale est. = 0.0085686 n = 20
Plot of estimated regression function
plot(gam.example)
0.0 0.2 0.4 0.6 0.8 1.0
0.6 0.4 0.2 0.0 0.2 0.4
x
s(x,8.31)
Note that estimated function is centred on zero — need to add
intercept to get fitted values from model (more detail in Week 6
workshop)
Summary
GLMs are natural (?) extension of linear models
Response variable no longer restricted to be normal
Least squares fitting generalised to maximum likelihood
Operationally, everything (almost) identical to linear models
glm() instead of lm()
Same diagnostic procedures
Same procedures for model comparison, with deviance in place of RSS
Same tricks available for categorical covariates, seasonality,
interactions . . .
GAMS extend things even further — flexible regression methods
In practice, harder to fit and compare very complicated models
A useful book
Julian Faraway, Extending the Linear Model with R (2nd edition, 2016)
 Assignment status: Already Solved By Our Experts
 (USA, AUS, UK & CA PhD. Writers)
 CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS