STAT6110/STAT3110

Statistical Inference

Topic 5 – Estimation methods

Jun Ma

Topic 5

Semester 1, 2021

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 1 / 38

Topic 5 Outline: Estimation methods

Maximum likelihood estimation

Computation of the MLE

Information and standard errors

Properties of the MLE

Parameter transformations

Multiple parameters

Further estimation methods

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 2 / 38

Maximum likelihood estimation

Consider sample y = (y1 : : : ; yn) and likelihood function

L(θ) = L(θ; y)

Most supported parameter value θ^ will be the value for which

L(θ^) > L(θ0) whenever θ0 6= θ^

The corresponding estimate, θ^ = Θ(y), is called the maximum

likelihood estimate

Provides the parameter value that makes the observed sample the

most likely sample among all possible samples

Like any estimate, it is the observed value of a random variable, the

maximum likelihood estimator Θ = Θ( ^ Y )

Abbreviation MLE is used interchangeably

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 3 / 38

Maximum likelihood estimation

Previously we relied on being able to choose estimates that made

intuitive sense e.g. sample mean to estimate the population mean

Often it is not possible to find an intuitively obvious estimate

MLE is a general method of estimation

MLE can be applied in any situation where we can write down a

likelihood function

MLE provides a general criterion to base our estimate on { the most

supported parameter value

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 4 / 38

MLE: definition

An MLE is any value θ^ that is in the parameter space and for which

L(θ^) ≥ L(θ0) for any other parameter value θ0 that is also in the

parameter space

An MLE maximises the function L(θ) over the parameter space

If a particular parameter value maximises the likelihood function then

it will also maximise the log-likelihood function

An MLE maximises the function ‘(θ) = log L(θ) over the parameter

space

This is often used for computation because ‘(θ) is usually a more

mathematically convenient function

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 5 / 38

MLE: complications

Uniqueness:

I Usually only one unique maximum for the likelihood function

I In some complicated contexts there may be two or more parameter

values that achieve the maximum likelihood value

I Then we say that the MLE is not unique

I This is a problem when it arises, because we may not know which

maximum is the appropriate estimate

I In most contexts of interest the MLE is unique and this potential

complication does not arise

Parameter space:

I When finding the MLE we must consider only those values of the

parameter that are within the parameter space, even if the function

L(θ) may be evaluable outside the parameter space

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 6 / 38

MLE: example

Recall the fertility example (geometric distribution) and for now we

assume just one group

Sufficient statistic is y = Pi yi (total attempts by all couples)

Likelihood function for p (success probability in one cycle) is

L(p) = pn(1 – p)y-n

Since p is a probability, the parameter space is the interval [0; 1]

MLE is defined as the maximum of L(p) over [0; 1]

Note that it is possible to evaluate the function L(p) for values of p

outside this interval

Values of p outside [0; 1] must be ignored in determining the MLE

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 7 / 38

MLE: example

Consider a sample in which n = 2 couples take a total of y = 4

attempts to achieve their first pregnancy

Corresponding likelihood function is

L(p) = p2(1 – p)2

This is plotted on the next slide

Maximum of the likelihood function over the parameter space occurs

at the MLE ^ p = 0:5, corresponding to an estimated 50% probability

of pregnancy in any given cycle

Notice that L(p) may achieve higher values outside the parameter

space, however, these are of no relevance for determining the MLE

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 8 / 38

0.0 0.5 1.0

0.00 0.05 0.10 0.15 0.20 0.25 0.30

p

Likelihood

●

parameter space

MLE

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 9 / 38

Computation of the MLE

We will start by considering a one-dimensional parameter and

generalise later

Calculus is often a useful tool for computing the MLE

In many contexts, the maximum value of the likelihood function, or

equivalently the log-likelihood function, occurs at a point where its

derivative is zero { a stationary point

Then the MLE solves an equation that involves setting the derivative

of the log-likelihood function (with respect to the parameter θ) equal

to zero

This leads to the so-called likelihood equation, or score equation,

d dθ

‘(θ) = 0

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 10 / 38

Computation of the MLE

The maximum of the likelihood function may not occur at a

stationary point

This is less common but is worth being aware of, and may make it

necessary to visually inspect the log-likelihood function by plotting it

out

In practice there exist mathematical criteria that can be used to

establish whether the MLE is the solution of the likelihood equation

or not

If the log-likelihood function is a concave function, meaning that its

second derivative is less than zero for all values of θ, then if a solution

of the likelihood equation can be found it is the unique MLE

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 11 / 38

Example: computation of MLE

Log-likelihood function for geometric fertility example is

‘(p) = n log(p) + (y – n) log(1 – p)

which has derivative

d

dp ‘(p) = pn – y1 — pn

Setting equal to zero and solving leads to the MLE of p

p^ = n

y

If there are n = 2 couples and y = 4 pregnancy attempts, then the

MLE is ^ p = 0:5, as before

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 12 / 38

Example: computation of MLE

The previous approach is only appropriate when the maximum of the

log-likelihood is a stationary point

The maximum of the log-likelihood may not occur at a stationary

point for this example

Consider the case when all couples in the sample achieve pregnancy

on their first cycle

Then y = n and the log-likelihood reduces to

‘(p) = n log(p)

This is plotted on the next slide where MLE is clearly ^ p = 1

Is this MLE a sensible estimate?

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 13 / 38

0.0 0.2 0.4 0.6 0.8 1.0

-8 -6 -4 -2 0

p

log-likelihood

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 14 / 38

Computation of the MLE

In the previous example we solved the likelihood equation to get an

expression for the MLE that can be evaluated

When this is possible we say that the MLE is explicit

In more complicated situations we may be able to write down the

likelihood equation, but it may not be possible to solve it algebraically

In that case the MLE is implicit and we use iterative algorithms to

solve the likelihood equation numerically

The most common examples of these techniques are the so-called

Fisher Scoring and Newton-Raphson algorithms

Sometimes it may not even be possible to write down the derivative

of the log-likelihood. In that case one can sometimes use

derivative-free” iterative algorithms to compute the MLE

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 15 / 38

Likelihood curvature

We will be more confident about the MLE as an estimate if there is

only a very narrow range of parameter values that achieve a likelihood

value close to the MLE

If there is a wide range of parameter values that achieve almost as

high a likelihood as the MLE we would be less confident in the MLE

as an estimate

flat likelihood = not confident in MLE

pointy likelihood = confident in the MLE

Geometric fertility example: (n; y) = (2; 4) and (n; y) = (80; 160)

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 16 / 38

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

p

Likelihood

n=80

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

p

Likelihood

n=2

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 17 / 38

Log-likelihood second derivative

The second derivative of a function tells us the rate at which the

gradient is changing

In the vicinity of the MLE, where a log-likelihood function achieves its

maximum, the second derivative is negative

If the gradient is reducing at a very slow rate the log-likelihood will be

flat { second derivative slightly less than zero

If the gradient is reducing at a very fast rate the log-likelihood will be

pointy { second derivative much less than zero

The extent to which the second derivative of the log-likelihood

function is less than zero is therefore a measure of the log-likelihood

curvature

The magnitude of the log-likelihood second derivative at its maximum

is a measure of our confidence in the MLE

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 18 / 38

Information

Consider the negative of the second derivative of the log-likelihood

function,

IO(θ) = – d2

dθ2 ‘(θ)

We call IO(θ) the observed information at the parameter value θ

When evaluated at the MLE, θ = θ^, the observed information

quantifies the curvature of the log-likelihood function at its maximum

When IO(θ^) is large” we will have high confidence in the MLE as an

estimate of the parameter, and when IO(θ^) is closer to zero we will

have less confidence

IO(θ^) will be large when the data are very informative about the

parameter, and will be small when the data are less informative

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 19 / 38

Information and standard errors

Information is more than just a measure of confidence in the MLE

The expectation of the observed information is referred to as the

expected information

The expected information is defined as

I(θ) = EIO(θ; Y )

It is important because it quantifies the MLE sampling variability

Var(Θ) ^ ≈ I(θ)-1 when n is large

A standard error based on the MLE in large samples is

SE(Θ) ^ ≈ qI(θ^)-1

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 20 / 38

Properties of the MLE

In general the MLE is not unbiased, but it may be unbiased

For example, for the normal distribution, the sample mean is an

unbiased MLE for µ, but Sn2 is a biased MLE of σ2

While the MLE is not unbiased, it is a consistent estimator

The asymptotic distribution of the MLE is a normal distribution,

which is desirable for calculating confidence intervals

In particular, we have the following convergence in distribution

property for an MLE

pI(θ)Θ^ – θ -! d N(0; 1)

So the sampling distribution of the MLE is approximately normal

Θ^ ≈d Nθ; I(θ)-1 | when n is large |

AssignmentTutorOnline

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 21 / 38

Properties of the MLE

When the asymptotic variance of an estimator depends on the

parameter we approximate the variance by using an estimate

This leads to a more useful approximate distribution for the MLE

Θ^ ≈d Nθ; I(θ^)-1 | when n is large |

Confidence intervals for the parameter can then be calculated using

θ^± z1-α=2qI(θ^)-1

where zx is the x-percentile of the standard normal distribution (e.g.

1.96 for a 95% confidence interval)

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 22 / 38

Properties of the MLE

The MLE is usually considered the best” approach to estimation

because it holds an optimality property

Recall the minimum variance bound from Lecture 8. Since the log of

the probability (density) of the sample is identical to the log-likelihood

function ‘(θ), the minimum possible asymptotic variance is I(θ)-1

The MLE therefore has an asymptotic efficiency of 1, or is simply

asymptotically efficient

Although the MLE has desirable properties most of the time, these

properties do not hold in every situation

When the MLE is on the boundary” of the parameter space the

above asymptotic properties don’t hold e.g. geometric with ^ p = 1

Another important exception is (high-dimensional) multiple parameter

situations as discussed later

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 23 / 38

Example: properties of the MLE

In the geometric fertility example we had log-likelihood derivative

d

dp ‘(p) = pn – y1 — pn

Differentiate again for observed information

IO(p; y) = – d2

dp2 ‘(p) = –pn2 – (1y–pn)2

This gives the expected information

I(p) = EIO(p; Y ) = n

p2 +

E(Y ) – n

(1 – p)2

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 24 / 38

Example: properties of the MLE

Using E(Yi) = 1=p for geometric distribution, we have

E(Y ) = Pn i=1 E(Yi) = n=p. Thus,

I(p) = n

p2(1 – p)

95% confidence interval is

n y

± 1:96rp^2(1n- p^)

For the two scenarios plotted earlier with ^ p = 0:5:

I n = 2: confidence interval is (0:01; 0:99)

I n = 80: confidence interval is (0:42; 0:58)

How do the results compare with our earlier log-likelihood plots?

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 25 / 38

Parameter transformations

Sometimes we may be more interested in some function of a

parameter, rather than the original parameter itself

For example, the odds of an event θ=(1 – θ), rather than the

probability of an event θ

For any one-to-one transformation of the parameter g(θ), the MLE of

the transformed parameter is g(θ^)

For example, the MLE of the odds is θ= ^ (1 – θ^) where θ^ is the MLE

This property of the MLE is called equivariance

Equivariance means that we do not need to do a separate likelihood

maximisation if we change the parameter scale

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 26 / 38

Parameter transformations

Computation of the MLE of a transformed parameter g(θ) is

straightforward if we already have the MLE of θ

Computation of an associated standard error may not be so easy

There is a general method for computing approximate standard errors

associated with g(θ^), assuming that we already have se(θ^)

By using a series expansion of the function g(θ) we obtain the

approximate variance

VargΘ^ ≈ g0(θ)2VarΘ^

This give the approximate standard error

seg(θ^) ≈ g0(θ^)se(θ^)

This approximate method is called the Delta-method

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 27 / 38

Example: Delta-method

Consider Y ∼ Bin(n; θ) with MLE of θ

θ^ = y=n

Sometimes we are more interested in the log-odds

g(θ) = log1 -θ θ

One reason for this is that asymptotic normality is a better

approximation on the log-odds scale so a confidence interval for the

log-odds g(θ) will be more accurate than a confidence interval for the

probability θ

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 28 / 38

Example: Delta-method

Differentiating g(θ) gives

g0(θ) = 1

θ(1 – θ)

This gives an approximate standard error which can be used for

constructing confidence intervals for the log-odds by the

delta-method:

seg(θ^) ≈ θ^(11- θ^)sθ^(1n- θ^) = snθ^(11- θ^)

Why would asymptotic normality be a more accurate approximation

for estimating the log-odds g(θ) than for estimating the probability θ?

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 29 / 38

Multiple parameters

If the parameter is a vector θ = (θ1; : : : ; θp) then the log-likelihood is

a multivariable function ‘(θ) = ‘(θ1; : : : ; θp)

Instead of a single likelihood equation, we now have a collection of p

likelihood equations or score equations

@

@θ1 ‘(θ1; : : : ; θp) = 0

…

@

@θ

p

‘(θ1; : : : ; θp) = 0

Collection of p simultaneous equations

MLE: solution θ^ = (θ^1; : : : ; θ^p) is the maximum of the multivariable

function ‘(θ)

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 30 / 38

Multiple parameters

With multiple parameters the MLE is consistent and asymptotically

multivariate normal

Let

Iij = @2

@θ[email protected]θj ‘(θ1; : : : ; θp)

The observed information matrix is the p × p matrix

IO(θ) = [Iij]

The expected information matrix is the p × p matrix

I(θ) = EIO(θ)

Approximate multivariate normal distribution

Θ^ ≈d N pθ; I(θ^)-1 |
when n is large |

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 31 / 38

Multiple parameters

Since θ is a p-dimensional vector, the large sample distribution of the

MLE has a p-dimensional multivariate normal distribution

The inverse of the expected information, I(θ)-1, is now a matrix

inverse rather than a reciprocal

Standard errors associated with each θ^i are obtained from the

corresponding diagonal elements of the inverse of the information

matrix, after substituting in the parameter estimate θ^

Standard error associated with θ^i is the square root of the (i; i)

element of I(θ^)-1

Exception to consistency and approximate normality: when the

number of parameters increases as n increases then consistency and

asymptotic normality may not hold

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 32 / 38

Further estimation methods

Maximum likelihood is the most common and most generally

applicable method of constructing estimators for parametric models

In some specific cases we may consider other methods:

1 When the model is very complicated, the likelihood function may be

difficult to write down and evaluate, so it may be difficult to

numerically maximise

2 For some specific parametric models maximum likelihood estimation

takes a particular form that can be used more generally for other

models, even though it may not lead to the MLE

3 The MLE is asymptotically optimal but it is usually biased in small

samples. Sometimes it is possible to identify an optimal unbiased

estimator

We briefly consider estimation methods dealing with these 3 scenarios

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 33 / 38

Method of moments

Recall that the kth moment of a random variable Y is

µk = E(Y k)

Similarly, the kth sample moment of a sample y = (y1; : : : ; yn) is

mk =

nXi

=1

yik

Method of moments: For a model with p parameters

θ = (θ1; : : : ; θp) construct p simultaneous equations and solve for θ

µ1(θ) = m1

…

µp(θ) = mp

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 34 / 38

Method of moments

Sometimes method of moments estimators are the same as the MLE

for simple models, but often they are different

Method of moments estimators tend to be easier to derive than MLEs

but they tend to be inefficient compared to MLEs

For the sort of models that we consider in this unit, method of

moments estimation is rarely used

Exercise 1: derive the method of moments estimators for µ and σ2

in a N(µ; σ2) distribution and compare with the MLE

Exercise 2: derive the MLE and the method of moments estimator

for a Uniform(0; θ) distribution and compare.

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 35 / 38

Least squares estimation

When Yi ∼ N(µ; σ2), maximising the log-likelihood as a function of µ

is equivalent to minimising the sum of squares:

minimise SS(µ) =

nXi

=1

(yi – µ)2

So the MLE of µ can be viewed as a least squares (LS) estimator

This interpretation extends conveniently to more complex models

Consider p covariates xi = (x1i; : : : ; xpi) for each observation yi

The standard regression model for relating µi = E(Yi) to xi is

Yi ∼ N(µi; σ2) where µi(θ) = θ0 + θ1x1i + · · · + θpxpi

or in matrix form: µ = Xθ

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 36 / 38

Least squares estimation

Regression models, and other linear models, are estimated using LS

estimation, which is equivalent to MLE when Yi is normal:

minimise SS(θ) = SS(θ0; : : : ; θp) =

nXi

=1

yi – µi(θ)2

The linear structure of µi(θ) makes this LS minimisation equivalent to

solving the following linear equations

XTy = XTXθ giving θ^ = XTX-1XTy

In principle LS estimation can also be used for non-normal models

In non-normal models LS estimation will usually be simpler than ML

estimation, but may be inefficient

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 37 / 38

Minimum variance unbiased estimation

We have already discussed how to test whether a given unbiased

estimator attains the minimum variance bound

An important result from theoretical statistics says that the minimum

variance unbiased estimator (MVUE) is a function of the minimal

sufficient statistic (with some additional theoretical conditions that

are satisfied for the sorts of models considered in this unit)

This leads to an estimation method:

I Find the minimal sufficient statistic T(Y )

I Find some function H such that EHT(Y ) = θ

I Then HT(Y ) is the MVUE

In practice this cannot always be done and it is of largely theoretical

interest only

In practice we usually use the MLE for parametric models

Jun Ma (Topic 5) STAT6110/STAT3110 Statistical Inference Semester 1, 2021 38 / 38

- Assignment status: Already Solved By Our Experts
*(USA, AUS, UK & CA PhD. Writers)***CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS**

**NO PLAGIARISM**– CUSTOM PAPER