A simple MLE algorithm using gradient ascent

This is a simple maximum-likelihood (ML) estimation algorithm for the normal distribution. I chose $\mathcal{N}(\mu, \sigma)$ because it is the most commonly known distribution. Of course, the ML estimators of $\mu$ and $\sigma$ are the sample mean and the sample variance (this can be proven analytically). However, I am interested in implementing a numerical approach.

The purpose of showing a numerical method for ML estimation is simple: I would have loved to see one when I was first studying mathematical statistics. Hopefully, some lost cyber-wanderer will find this example helpful in understanding how ML algorithms work.

Since the idea is to provide an algorithmic appreciation of the method, I will only briefly sketch the theory of MLE, which I assume the reader knows.

Maximum-likelihood estimation

Maximum-likelihood estimation consists in taking a vector $\textbf{x}$ of data as a fixed constant and the parameters of a distribution as variables. These parameters are denoted with the letter $\theta$. Observe that $\theta$ represents an $n$-tuple of parameters —not a unique one. In the case of the normal distribution, we have $\theta \mapsto (\mu, \sigma)$.

The idea is to observe the joint probability of $\textbf{x}$ as a function of $\theta$. Whatever $\hat{\theta}$ maximizes the function is by definition the ML estimator of the true parameters $\theta$.

If $f_X$ is the PDF of the identically distributed random variables in $\textbf{x}$, then the joint-probability (also called likelihood) of $\textbf{x}$ is

$$ \begin{align*} \mathcal{L}(\theta; \textbf{x}) = \prod_{j=1}^{n} f_X(x_j; θ) .\end{align*} $$

A key mathematical observation is that whatever $\hat{\theta}$ maximizes the function above is also the maximizer of $\ln \mathcal{L}$. In other words,

$$ \begin{align*} \underset{\theta}{\operatorname{\text{arg max }}} \mathcal{L} = \underset{\theta}{\operatorname{\text{arg max }}} \ln \mathcal{L} \end{align*} $$

We will let $\ln \mathcal{L} := \ell$. Now let us observe the case where $f_X$ is the PDF of the normal distribution.

The normal distribution case

We have the log-likelihood of $\theta$ given an observation $x$ to be

$$ \begin{align*} \ell(\theta; x) &= -\frac{1}{2\sigma^2}\left( x - \mu\right)^2 - \ln(\sigma) - \frac{1}{2}\ln(2\pi) \newline &=-\frac{1}{2\sigma^2}\left( x - \mu\right)^2 - \frac{1}{2}\ln(\sigma^2) - \frac{1}{2}\ln(2\pi) \newline \end{align*} $$

$$ \begin{align*} \ell (\theta; \textbf{x}) &= \sum_{j=1}^n \left(-\frac{1}{2\sigma^2}\left( x_j - \mu\right)^2 - \frac{1}{2}\ln(\sigma^2) - \frac{1}{2}\ln(2\pi)\right) \newline &= \left(-\frac{1}{2\sigma^2} - \frac{1}{2}\ln(\sigma^2) - \frac{1}{2}\ln(2\pi)\right)n + \sum_{j=1}^n (x_j- \mu)^2 \newline &= -\frac{n}{2}\ln 2\pi -\frac{n}{2} \ln \sigma^2 - \frac{1}{2\sigma^2} \sum_{j = 1}^n (x_j - \mu)^2 .\end{align*} $$

This is the function that we need to maximize in terms of $\theta$. Since the first term is independent of the distribution parameters, minimization will pertain only to

$$ \begin{align*} f(\sigma, \mu; \vec{x}) = -\frac{n}{2}\ln \sigma^2 - \frac{1}{2\sigma^2}\sum_{j=1}^n(x_j - \mu)^2 .\end{align*} $$

To maximize it, we will use gradient ascent. This is, we will compute the gradient of $f$ and take infinitesimal steps towards the direction of steepest ascent (which is the gradient itself). As a convergence criteria, we will use the magnitude (or norm) of the gradient. The point of this criteria is that as we shift $\theta$ towards a maximum, the gradient itself becomes ever smaller in magnitude.

Now we must observe that

$$ \begin{align*} \frac{\partial f}{\partial \sigma} &= -n\frac{\partial }{\partial \sigma} \ln \sigma - \frac{1}{2}\sum_{j=1}^n (x_j - \mu)^2 \frac{\partial }{\partial \sigma} \sigma^{-2} \newline &= -\frac{n}{\sigma} - \frac{1}{2}\sum_{j=1}^n (x_j - \mu)^2 \left( -\frac{2}{\sigma^3} \right) \newline &= -\frac{n}{\sigma} + \frac{\Sigma}{\sigma^3} .\end{align*} $$

where $\Sigma$ denotes the $\mu$-dependent summation. At the same time (we skip the steps),

$$ \begin{align*} \frac{\partial f}{\partial \mu} &= \frac{1}{\sigma^2} \left( \sum_{j=1}^n x_j - n\mu\right) .\end{align*} $$

A Julia implementation

The previous section gave us everything we needed. We have an expression for the gradient of $\ell$ in terms of the distribution parameters, and we now only must implement a gradient ascent algorithm using it. A Julia implementation might go as follows:

using LinearAlgebra # To use the norm() function.

function N(x::Number, μ, σ)
    """Probability density of x ~ 𝓝(μ, σ)."""
    1 / sqrt(2pi) * exp(-0.5 * ((x - μ)/σ)^2)
end

function L(y::Vector, μ, σ)
    """The likelihood function"""
    fₓ(x) = N(x, μ, σ)
    prod(fₓ, y)
end

function pder_μ(μ::Number, σ::Number, x::Vector)
    """Partial derivative of L with respect to μ"""
    1/(σ^2) * (sum(x) - length(x) * μ)
end

function pder_σ(μ::Number, σ::Number, x::Vector)
    """Partial derivative of L with respect to σ"""
    -length(x)/σ + sum((x .- μ).^2)/σ^3
end

# The following two functions are to allow for syntactic sugar.

function pder_σ(θ::Vector, x::Vector)
    pder_σ(θ[1], θ[2], x)
end

function pder_μ(θ::Vector, x::Vector)
    pder_μ(θ[1], θ[2], x)
end

# The gradient ascent algorithm.

function grad_ascent(data, μ₀, σ₀, tol=0.01, lr=0.01)
    """Performs gradient ascent over the likelihood function given a vector of
    data. μ₀, σ₀ are initial parameters (can be random numbers)."""
    θ = [μ₀, σ₀] while true 
        ∇ℓ = [pder_μ(θ, data), pder_σ(θ, data)]
        θ = θ .+ ∇ℓ .* lr
        if norm(∇ℓ) < tol 
            break 
        end
    end
    return θ
end

# Example run

# Simulate data from a normal distribution with μ = 5, σ = 3.
data = 5 .+ 3 .* randn(100)

# Perform MLE. Initial parameters are arbitrarilly set to -10 and 3.
res = grad_ascent(data, -10, 3)

# The returned estimated parameters were 4.34 and 2.98.