STAT6110/STAT3110
Statistical Inference
Topic 5 – Estimation methods
Jun Ma
Topic 5
Semester 1, 2021
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 1 / 38
Topic 5 Outline: Estimation methods
Maximum likelihood estimation
Computation of the MLE
Information and standard errors
Properties of the MLE
Parameter transformations
Multiple parameters
Further estimation methods
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 2 / 38
Maximum likelihood estimation
Consider sample y = (y1 : : : ; yn) and likelihood function
L(θ) = L(θ; y)
Most supported parameter value θ^ will be the value for which
L(θ^) > L(θ0) whenever θ0 6= θ^
The corresponding estimate, θ^ = Θ(y), is called the maximum
likelihood estimate
Provides the parameter value that makes the observed sample the
most likely sample among all possible samples
Like any estimate, it is the observed value of a random variable, the
maximum likelihood estimator Θ = Θ( ^ Y )
Abbreviation MLE is used interchangeably
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 3 / 38
Maximum likelihood estimation
Previously we relied on being able to choose estimates that made
intuitive sense e.g. sample mean to estimate the population mean
Often it is not possible to find an intuitively obvious estimate
MLE is a general method of estimation
MLE can be applied in any situation where we can write down a
likelihood function
MLE provides a general criterion to base our estimate on { the most
supported parameter value
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 4 / 38
MLE: definition
An MLE is any value θ^ that is in the parameter space and for which
L(θ^) ≥ L(θ0) for any other parameter value θ0 that is also in the
parameter space
An MLE maximises the function L(θ) over the parameter space
If a particular parameter value maximises the likelihood function then
it will also maximise the log-likelihood function
An MLE maximises the function ‘(θ) = log L(θ) over the parameter
space
This is often used for computation because ‘(θ) is usually a more
mathematically convenient function
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 5 / 38
MLE: complications
Uniqueness:
I Usually only one unique maximum for the likelihood function
I In some complicated contexts there may be two or more parameter
values that achieve the maximum likelihood value
I Then we say that the MLE is not unique
I This is a problem when it arises, because we may not know which
maximum is the appropriate estimate
I In most contexts of interest the MLE is unique and this potential
complication does not arise
Parameter space:
I When finding the MLE we must consider only those values of the
parameter that are within the parameter space, even if the function
L(θ) may be evaluable outside the parameter space
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 6 / 38
MLE: example
Recall the fertility example (geometric distribution) and for now we
assume just one group
Sufficient statistic is y = Pi yi (total attempts by all couples)
Likelihood function for p (success probability in one cycle) is
L(p) = pn(1 – p)y-n
Since p is a probability, the parameter space is the interval [0; 1]
MLE is defined as the maximum of L(p) over [0; 1]
Note that it is possible to evaluate the function L(p) for values of p
outside this interval
Values of p outside [0; 1] must be ignored in determining the MLE
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 7 / 38
MLE: example
Consider a sample in which n = 2 couples take a total of y = 4
attempts to achieve their first pregnancy
Corresponding likelihood function is
L(p) = p2(1 – p)2
This is plotted on the next slide
Maximum of the likelihood function over the parameter space occurs
at the MLE ^ p = 0:5, corresponding to an estimated 50% probability
of pregnancy in any given cycle
Notice that L(p) may achieve higher values outside the parameter
space, however, these are of no relevance for determining the MLE
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 8 / 38
0.0 0.5 1.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30
p
Likelihood
●
parameter space
MLE
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 9 / 38
Computation of the MLE
We will start by considering a one-dimensional parameter and
generalise later
Calculus is often a useful tool for computing the MLE
In many contexts, the maximum value of the likelihood function, or
equivalently the log-likelihood function, occurs at a point where its
derivative is zero { a stationary point
Then the MLE solves an equation that involves setting the derivative
of the log-likelihood function (with respect to the parameter θ) equal
to zero
This leads to the so-called likelihood equation, or score equation,
d dθ
‘(θ) = 0
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 10 / 38
Computation of the MLE
The maximum of the likelihood function may not occur at a
stationary point
This is less common but is worth being aware of, and may make it
necessary to visually inspect the log-likelihood function by plotting it
out
In practice there exist mathematical criteria that can be used to
establish whether the MLE is the solution of the likelihood equation
or not
If the log-likelihood function is a concave function, meaning that its
second derivative is less than zero for all values of θ, then if a solution
of the likelihood equation can be found it is the unique MLE
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 11 / 38
Example: computation of MLE
Log-likelihood function for geometric fertility example is
‘(p) = n log(p) + (y – n) log(1 – p)
which has derivative
d
dp ‘(p) = pn – y1 — pn
Setting equal to zero and solving leads to the MLE of p
p^ = n
y
If there are n = 2 couples and y = 4 pregnancy attempts, then the
MLE is ^ p = 0:5, as before
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 12 / 38
Example: computation of MLE
The previous approach is only appropriate when the maximum of the
log-likelihood is a stationary point
The maximum of the log-likelihood may not occur at a stationary
point for this example
Consider the case when all couples in the sample achieve pregnancy
on their first cycle
Then y = n and the log-likelihood reduces to
‘(p) = n log(p)
This is plotted on the next slide where MLE is clearly ^ p = 1
Is this MLE a sensible estimate?
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 13 / 38
0.0 0.2 0.4 0.6 0.8 1.0
-8 -6 -4 -2 0
p
log-likelihood
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 14 / 38
Computation of the MLE
In the previous example we solved the likelihood equation to get an
expression for the MLE that can be evaluated
When this is possible we say that the MLE is explicit
In more complicated situations we may be able to write down the
likelihood equation, but it may not be possible to solve it algebraically
In that case the MLE is implicit and we use iterative algorithms to
solve the likelihood equation numerically
The most common examples of these techniques are the so-called
Fisher Scoring and Newton-Raphson algorithms
Sometimes it may not even be possible to write down the derivative
of the log-likelihood. In that case one can sometimes use
derivative-free” iterative algorithms to compute the MLE
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 15 / 38
Likelihood curvature
We will be more confident about the MLE as an estimate if there is
only a very narrow range of parameter values that achieve a likelihood
value close to the MLE
If there is a wide range of parameter values that achieve almost as
high a likelihood as the MLE we would be less confident in the MLE
as an estimate
flat likelihood = not confident in MLE
pointy likelihood = confident in the MLE
Geometric fertility example: (n; y) = (2; 4) and (n; y) = (80; 160)
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 16 / 38
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
p
Likelihood
n=80
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
p
Likelihood
n=2
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 17 / 38
Log-likelihood second derivative
The second derivative of a function tells us the rate at which the
gradient is changing
In the vicinity of the MLE, where a log-likelihood function achieves its
maximum, the second derivative is negative
If the gradient is reducing at a very slow rate the log-likelihood will be
flat { second derivative slightly less than zero
If the gradient is reducing at a very fast rate the log-likelihood will be
pointy { second derivative much less than zero
The extent to which the second derivative of the log-likelihood
function is less than zero is therefore a measure of the log-likelihood
curvature
The magnitude of the log-likelihood second derivative at its maximum
is a measure of our confidence in the MLE
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 18 / 38
Information
Consider the negative of the second derivative of the log-likelihood
function,
IO(θ) = – d2
dθ2 ‘(θ)
We call IO(θ) the observed information at the parameter value θ
When evaluated at the MLE, θ = θ^, the observed information
quantifies the curvature of the log-likelihood function at its maximum
When IO(θ^) is large” we will have high confidence in the MLE as an
estimate of the parameter, and when IO(θ^) is closer to zero we will
have less confidence
IO(θ^) will be large when the data are very informative about the
parameter, and will be small when the data are less informative
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 19 / 38
Information and standard errors
Information is more than just a measure of confidence in the MLE
The expectation of the observed information is referred to as the
expected information
The expected information is defined as
I(θ) = EIO(θ; Y )
It is important because it quantifies the MLE sampling variability
Var(Θ) ^ ≈ I(θ)-1 when n is large
A standard error based on the MLE in large samples is
SE(Θ) ^ ≈ qI(θ^)-1
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 20 / 38
Properties of the MLE
In general the MLE is not unbiased, but it may be unbiased
For example, for the normal distribution, the sample mean is an
unbiased MLE for µ, but Sn2 is a biased MLE of σ2
While the MLE is not unbiased, it is a consistent estimator
The asymptotic distribution of the MLE is a normal distribution,
which is desirable for calculating confidence intervals
In particular, we have the following convergence in distribution
property for an MLE
pI(θ)Θ^ – θ -! d N(0; 1)
So the sampling distribution of the MLE is approximately normal
Θ^ ≈d Nθ; I(θ)-1 | when n is large |
AssignmentTutorOnline
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 21 / 38
Properties of the MLE
When the asymptotic variance of an estimator depends on the
parameter we approximate the variance by using an estimate
This leads to a more useful approximate distribution for the MLE
Θ^ ≈d Nθ; I(θ^)-1 | when n is large |
Confidence intervals for the parameter can then be calculated using
θ^± z1-α=2qI(θ^)-1
where zx is the x-percentile of the standard normal distribution (e.g.
1.96 for a 95% confidence interval)
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 22 / 38
Properties of the MLE
The MLE is usually considered the best” approach to estimation
because it holds an optimality property
Recall the minimum variance bound from Lecture 8. Since the log of
the probability (density) of the sample is identical to the log-likelihood
function ‘(θ), the minimum possible asymptotic variance is I(θ)-1
The MLE therefore has an asymptotic efficiency of 1, or is simply
asymptotically efficient
Although the MLE has desirable properties most of the time, these
properties do not hold in every situation
When the MLE is on the boundary” of the parameter space the
above asymptotic properties don’t hold e.g. geometric with ^ p = 1
Another important exception is (high-dimensional) multiple parameter
situations as discussed later
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 23 / 38
Example: properties of the MLE
In the geometric fertility example we had log-likelihood derivative
d
dp ‘(p) = pn – y1 — pn
Differentiate again for observed information
IO(p; y) = – d2
dp2 ‘(p) = –pn2 – (1y–pn)2
This gives the expected information
I(p) = EIO(p; Y ) = n
p2 +
E(Y ) – n
(1 – p)2
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 24 / 38
Example: properties of the MLE
Using E(Yi) = 1=p for geometric distribution, we have
E(Y ) = Pn i=1 E(Yi) = n=p. Thus,
I(p) = n
p2(1 – p)
95% confidence interval is
n y
± 1:96rp^2(1n- p^)
For the two scenarios plotted earlier with ^ p = 0:5:
I n = 2: confidence interval is (0:01; 0:99)
I n = 80: confidence interval is (0:42; 0:58)
How do the results compare with our earlier log-likelihood plots?
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 25 / 38
Parameter transformations
Sometimes we may be more interested in some function of a
parameter, rather than the original parameter itself
For example, the odds of an event θ=(1 – θ), rather than the
probability of an event θ
For any one-to-one transformation of the parameter g(θ), the MLE of
the transformed parameter is g(θ^)
For example, the MLE of the odds is θ= ^ (1 – θ^) where θ^ is the MLE
This property of the MLE is called equivariance
Equivariance means that we do not need to do a separate likelihood
maximisation if we change the parameter scale
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 26 / 38
Parameter transformations
Computation of the MLE of a transformed parameter g(θ) is
straightforward if we already have the MLE of θ
Computation of an associated standard error may not be so easy
There is a general method for computing approximate standard errors
associated with g(θ^), assuming that we already have se(θ^)
By using a series expansion of the function g(θ) we obtain the
approximate variance
VargΘ^ ≈ g0(θ)2VarΘ^
This give the approximate standard error
seg(θ^) ≈ g0(θ^)se(θ^)
This approximate method is called the Delta-method
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 27 / 38
Example: Delta-method
Consider Y ∼ Bin(n; θ) with MLE of θ
θ^ = y=n
Sometimes we are more interested in the log-odds
g(θ) = log1 -θ θ
One reason for this is that asymptotic normality is a better
approximation on the log-odds scale so a confidence interval for the
log-odds g(θ) will be more accurate than a confidence interval for the
probability θ
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 28 / 38
Example: Delta-method
Differentiating g(θ) gives
g0(θ) = 1
θ(1 – θ)
This gives an approximate standard error which can be used for
constructing confidence intervals for the log-odds by the
delta-method:
seg(θ^) ≈ θ^(11- θ^)sθ^(1n- θ^) = snθ^(11- θ^)
Why would asymptotic normality be a more accurate approximation
for estimating the log-odds g(θ) than for estimating the probability θ?
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 29 / 38
Multiple parameters
If the parameter is a vector θ = (θ1; : : : ; θp) then the log-likelihood is
a multivariable function ‘(θ) = ‘(θ1; : : : ; θp)
Instead of a single likelihood equation, we now have a collection of p
likelihood equations or score equations
@
@θ1 ‘(θ1; : : : ; θp) = 0
…
@
@θ
p
‘(θ1; : : : ; θp) = 0
Collection of p simultaneous equations
MLE: solution θ^ = (θ^1; : : : ; θ^p) is the maximum of the multivariable
function ‘(θ)
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 30 / 38
Multiple parameters
With multiple parameters the MLE is consistent and asymptotically
multivariate normal
Let
Iij = @2
@θ[email protected]θj ‘(θ1; : : : ; θp)
The observed information matrix is the p × p matrix
IO(θ) = [Iij]
The expected information matrix is the p × p matrix
I(θ) = EIO(θ)
Approximate multivariate normal distribution
Θ^ ≈d N pθ; I(θ^)-1 |
when n is large |
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 31 / 38
Multiple parameters
Since θ is a p-dimensional vector, the large sample distribution of the
MLE has a p-dimensional multivariate normal distribution
The inverse of the expected information, I(θ)-1, is now a matrix
inverse rather than a reciprocal
Standard errors associated with each θ^i are obtained from the
corresponding diagonal elements of the inverse of the information
matrix, after substituting in the parameter estimate θ^
Standard error associated with θ^i is the square root of the (i; i)
element of I(θ^)-1
Exception to consistency and approximate normality: when the
number of parameters increases as n increases then consistency and
asymptotic normality may not hold
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 32 / 38
Further estimation methods
Maximum likelihood is the most common and most generally
applicable method of constructing estimators for parametric models
In some specific cases we may consider other methods:
1 When the model is very complicated, the likelihood function may be
difficult to write down and evaluate, so it may be difficult to
numerically maximise
2 For some specific parametric models maximum likelihood estimation
takes a particular form that can be used more generally for other
models, even though it may not lead to the MLE
3 The MLE is asymptotically optimal but it is usually biased in small
samples. Sometimes it is possible to identify an optimal unbiased
estimator
We briefly consider estimation methods dealing with these 3 scenarios
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 33 / 38
Method of moments
Recall that the kth moment of a random variable Y is
µk = E(Y k)
Similarly, the kth sample moment of a sample y = (y1; : : : ; yn) is
mk =
nXi
=1
yik
Method of moments: For a model with p parameters
θ = (θ1; : : : ; θp) construct p simultaneous equations and solve for θ
µ1(θ) = m1
…
µp(θ) = mp
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 34 / 38
Method of moments
Sometimes method of moments estimators are the same as the MLE
for simple models, but often they are different
Method of moments estimators tend to be easier to derive than MLEs
but they tend to be inefficient compared to MLEs
For the sort of models that we consider in this unit, method of
moments estimation is rarely used
Exercise 1: derive the method of moments estimators for µ and σ2
in a N(µ; σ2) distribution and compare with the MLE
Exercise 2: derive the MLE and the method of moments estimator
for a Uniform(0; θ) distribution and compare.
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 35 / 38
Least squares estimation
When Yi ∼ N(µ; σ2), maximising the log-likelihood as a function of µ
is equivalent to minimising the sum of squares:
minimise SS(µ) =
nXi
=1
(yi – µ)2
So the MLE of µ can be viewed as a least squares (LS) estimator
This interpretation extends conveniently to more complex models
Consider p covariates xi = (x1i; : : : ; xpi) for each observation yi
The standard regression model for relating µi = E(Yi) to xi is
Yi ∼ N(µi; σ2) where µi(θ) = θ0 + θ1x1i + · · · + θpxpi
or in matrix form: µ = Xθ
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 36 / 38
Least squares estimation
Regression models, and other linear models, are estimated using LS
estimation, which is equivalent to MLE when Yi is normal:
minimise SS(θ) = SS(θ0; : : : ; θp) =
nXi
=1
yi – µi(θ)2
The linear structure of µi(θ) makes this LS minimisation equivalent to
solving the following linear equations
XTy = XTXθ giving θ^ = XTX-1XTy
In principle LS estimation can also be used for non-normal models
In non-normal models LS estimation will usually be simpler than ML
estimation, but may be inefficient
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 37 / 38
Minimum variance unbiased estimation
We have already discussed how to test whether a given unbiased
estimator attains the minimum variance bound
An important result from theoretical statistics says that the minimum
variance unbiased estimator (MVUE) is a function of the minimal
sufficient statistic (with some additional theoretical conditions that
are satisfied for the sorts of models considered in this unit)
This leads to an estimation method:
I Find the minimal sufficient statistic T(Y )
I Find some function H such that EHT(Y ) = θ
I Then HT(Y ) is the MVUE
In practice this cannot always be done and it is of largely theoretical
interest only
In practice we usually use the MLE for parametric models
Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 38 / 38
- Assignment status: Already Solved By Our Experts
- (USA, AUS, UK & CA PhD. Writers)
- CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS
