A Gentle Introduction to Minimax Estimators ¶
You have probably heard about minimax estimators, bayes estimators etc. This is an attempt to write about what they are and why we care, with a very simple real world example.
Example: Tossing Coins ¶
Our example is actually really simple. Lets say you have tossed a coin 10 times, and got a head 4 times. What is the probability of getting a head? This might sounds very straightforward. Of course 4/10 = 0.4! But in fact it is not that simple. It may be a fair coin(p=0.5) and we just got 4 heads instead of 5 by random chance. What might be a good way to estimate the probability of getting a head? This is the problem we will try to answer.
Risk of an Estimator ¶
Often, we talk about estimators that minimize the "mean squared error" or MSE. There are many reasons why this is useful (and also many reasons why it might not be). But this is an example of a Loss Function . Let $\theta$ be a parameter. For example, it is the true probability of falling heads, we do not know. We have seen some data (10 tosses, 4 heads) and we want to get a good estimate $\hat{\theta}$ of $\theta$.
What would be a good estimate? One way to get a good estimate is to minimize a loss, such as mean squared error. We define the loss function of the estimator: $$L(\theta, \hat{\theta}) = (\theta - \hat{\theta})^2$$
If we do a good job, then this number should be small. Hence, we want to define its mean value as the risk of an estimator, and minimize this risk. The risk of an estimator is: $$R(\theta, \hat{\theta}(X)) = \mathbb{E}_\theta L(\theta, \hat{\theta}(X))$$
It is important to unpack this. First, I have explicitly written $\hat{\theta}(X)$ to emphasize that this is a statistic, i.e. a function of the data. In frequentist world, data is random, parameter is fixed. So, as we draw different random samples of the data, we get different statistic, and loss functions. The expectation is with respect to the distribution of the data, i.e. an average of all the random draws I made, to get a loss function. When our loss function is the squared error, this is simply our well known MSE. $$R(\theta, \hat{\theta}(X)) = \mathbb{E}_\theta (\theta - \hat{\theta}(X))^2$$
You can consider other risk functions as well. For example, in real life, a more useful risk function would be the absolute loss, which is more robust to outliers. $$R(\theta, \hat{\theta}(X)) = \mathbb{E}_\theta |\theta - \hat{\theta}(X)|$$
Two Estimators ¶
I will propose two estimators of $\theta$, i.e. the probability of getting an head. We will calculate the risk function of those two estimators under squared loss.
MLE Estimator ¶
This is the one that you had guessed before. It is simply $$\hat{\theta}_1 = \frac{\sum X_i}{N}$$ For example if $X_i=1$ when falling head, and 0 when coin falls tails. You get 4 heads, and $\sum X_i = 4$ and $N=10$ and $\hat{\theta}_1 = \frac{4}{10} = 0.4$
What is the risk of this estimator. We will use some well known results. First, the bias-variance decomposition tells us that MSE = bias squared + variance. $\hat{\theta}$ is sample mean. It is an unbiased estimator of population mean, and $$Var(\hat{\theta}_1) = \frac{(\theta)(1 - \theta)}{N}$$
Therefore: $$R(\theta, \hat{\theta}_1) = \frac{(\theta)(1 - \theta)}{N}$$
Now, I want to point out a REALLY IMPORTANT THING. Notice that the risk of the estimator depends on the true value of the parameter, $\theta$ . Our goal was to find an estimator that has small risk. But it seems that the risk itself can be small or large, depending on the true value of the parameter. However, we don't know the true value (otherwise, why would be be trying to estimate?). Then how do we find a good estimator?
This is the challenge we are trying to address
A Bayesian Approach ¶
I won't spend a lot of time explaining what Bayesian statistics is, etc. But a well known bayesian approach is to have a conjugate prior, which is a beta distribution in this case. This is our belief about what $\theta$ is before we toss a coin. Maybe we believe the coin is fair, because generally all coins coming from the mint are fair. If our prior is $Beta(\alpha, \beta)$, the posterior distribution is beta-binomial, with a mean: $$\hat{\theta}_2 = \frac{\sum X_i + \alpha}{\alpha + \beta + N}$$
A simple physical understanding is in terms of imaginary tosses. Since 10 is a really small sample, we wish to insert our subjective belief by adding more imaginary tosses and heads and tails to make sure our estimate is good. Maybe we are very sure that the coin is fair. We insert $\alpha = 50$ heads and $\beta = 50$ tails which are imaginary, and reflect our prior belief. Then we will calculate $\hat{\theta}_2 = \frac{4 + 50}{50 + 50 + 10} = \frac{54}{110} = 0.491$. You see how the estimate is closer to 50% rather than the 40% we calculated from raw data?
Well, now begins a confusion, on which one might be a better estimator? To go there, we will need to repeat the same exercise as the previous one here. We need to find the MSE. Again, MSE = bias squared + variance.
Therefore, $$MSE = \left(\mathbb{E}_\theta\left(\frac{\sum X_i + \alpha}{\alpha + \beta + N}\right) - \theta \right)^2 + Var_\theta \left(\frac{\sum X_i + \alpha}{\alpha + \beta + N}\right) \\ = \left( \frac{N\theta + \alpha}{\alpha + \beta + N} - \theta \right)^2 + \frac{1}{(\alpha + \beta + N)^2} Var_\theta (\sum X_i + \alpha) \\ = \left( \frac{N\theta + \alpha}{\alpha + \beta + N} - \theta \right)^2 + \frac{N \theta(1 - \theta)}{(\alpha + \beta + N)^2}$$
The choice of $\alpha$ and $\beta$ is subjective. For reasons that will be clear later, lets make a particular choice. We choose, $\alpha = \beta = \sqrt{\frac{N}{4}}$. With this choice, $$\alpha + \beta + N = N + \sqrt{N}$$ and $$MSE = \frac{(\alpha - \alpha \theta - \beta \theta)^2 - (N \theta - N \theta^2)}{(\alpha + \beta + N)^2} = \frac{\frac{N}{4}(1 - 2\theta)^2 - (N \theta - N \theta^2)}{(N + \sqrt{N})^2} = \frac{N}{4(N + \sqrt{N})^2}$$
Notice that with this choice, the risk is now independent of the true parameter.
Which one is Better? Can't say. ¶
So, the question is, which estimator is better? Of course, the one with lower risk. Let's plot the risks of the estimators as a function of the true parameter $\theta$.
import numpy as np
import matplotlib.pyplot as plt
theta = np.linspace(0, 1, 100)
small_N = 10
mle_risk = theta * (1. - theta)/small_N
bayesian_risk = small_N/(4 * (small_N + np.sqrt(small_N))**2) * np.ones(100)
plt.plot(theta, mle_risk, 'bo-', label='MLE Estimator');
plt.plot(theta, bayesian_risk, 'ro-', label='Bayesian Estimator')
plt.legend()
This is where things get confusing. Which estimator is better? It depends . It depends on the true value of the parameter. If the true value of $\theta$ is 0.1, the MLE estimator is better. If $\theta$ is 0.5 (i.e. fair coin), then the Bayesian estimator is better. But we have no idea,what the true value of $\theta$ is. If we already knew whether the coin is fair, or had 10% chance of falling heads, we wouldn't be trying to estimate it. Therefore, our decision on which is a good estimator, depends on the quantity we are trying to estimate. This is the basic issue that we are trying to solve with minimax estimators .
Minimax Estimators ¶
One solution to this problem is the minimax estimator. Since the risk is a function of the true value, we need one single number. The minimax approach tries to find an estimator that minimizes the maximum risk. $$R_n = inf_{\hat{\theta}} sup_{\theta} R(\theta, \hat{\theta})$$
What does this mean? It means, lets look at each estimator (For example, we looked at two of them $\theta_1$ and $\theta_2$), and for each of estimator, we want to find the maximum risk for each estimator. For example, we have
$$max-risk(\theta_1) = \frac{1}{4N}$$and $$max-risk(\theta_2) = \frac{N}{4(N + \sqrt{N})^2}$$
Now, the minimax estimator is the one, for which the max-risk is the minimum . Here we compared these two estimators, and $\theta_2$ has the lower maximum risk.Is that the minimax estimator? For that to happen, it must have a minimum of maximum risk of all possible estimators. That seems like a daunting task and it is.
We have merely defined what minimax estimators are. Finding one is much harder, and beyond our scope.
So, did we find the minimax estimator? ¶
Although in general, it is hard to find the minimax estimator, in this case, we did. To prove this, we need to develop a little more mathematical machinery. First, we will define Bayes risk. In the Bayesian paradigm, the parameter is not fixed, but has a distribution. Minimax seems to be too rigid, optimizing for the worst case. What if instead we took an expectation of the risk over the distribution of the parameter? That seems like a more reasonable approach. On an average we will end up choosing the better estimator. This is the approach of the Bayes estimator. If $\pi(\theta)$ is the prior distribution of the parameter,then we define the bayes risk as $$B_\pi(\hat{\theta}) = \int R(\theta, \hat{\theta}) \pi(\theta) d\theta $$ The estimator that minimizes the Bayes risk is called the Bayes estimator.
However, we do not wish to go into the topic of Bayes estimators in this post. We invoked the idea in order to prove an important result. The result is simply that Bayes estimators with an constant risk function are minimax estimators . Remember we set $\alpha$ and $\beta$ cleverly, in order to make the risk independent of the parameter! This is the reason we did that.
One result we will omit, for the sake of brevity, is that, the Bayes estimator, when the risk function is squared loss, is the posterior mean. Thus we have assembled all the machinery. $\theta_2$ is a Bayes estimator, since it is the posterior mean, and it also has a constant risk. Hence, it is minimax.
Now, lets prove the last statement. To do so, we first claim that if for a given prior, $R(\theta, \hat{\theta}) \leq B_\pi(\hat{\theta})$ then $\hat{\theta}$ is minimax. The proof is by contradiction. Assume that $\hat{\theta}$ is not minimax, and there exists a $\theta_0$ which is minimax. This implies that $sup_\theta R(\theta, \theta_0) > sup_\theta R(\theta, \hat{\theta})$.But since the average of a function is less than the maximum, therefore, $B_\pi(\theta_0) < sup_\theta R(\theta, \theta_0)$. Therefore, $$B_\pi(\theta_0) \leq sup_\theta R(\theta, \theta_0) \leq sup_\theta R(\theta, \hat{\theta}) \leq B_\pi(\hat{\theta})$$ But this is a contradiction, because $\hat{\theta}$ is a Bayes estimator, it is the one that minimizes the Bayes risk. Hence it is not possible that $B_\pi(\theta_0) < B_\pi(\hat{\theta})$. Hence, the Bayes estimator mut be minimax in this case. If the risk $R(\theta, \hat{\theta})$is constant, this condition is trivially satisfied since $B_\pi(\hat{\theta}) = \int R(\theta, \hat{\theta}) d\pi =\int R(\theta, \hat{\theta}) d\pi = R(\theta, \hat{\theta}) $. Hence, we have shown that the bayes estimator with constant risk is a minimax estimator, and since $\theta_2$ satisfies these conditions, we can claim it is minimax.