A Deeper look at Mean Squared Error

In this post we're going to take a deeper look at Mean Squared Error. Despite the relatively simple nature of this metric, it contains a surprising amount of insight into modeling. By breaking down Mean Squared Error into bias and variance, we'll get a better sense of how models work and ways in which they can fail to perform. We'll also find that baked into the definition of this metric is the idea that there is always some irreducible uncertainty that we can never quite get rid of.

A quick explanation of bias

The primary mathematical tools we'll be using are expectation, variance and bias. We've talked about expectation and variance quite a bit on this blog, but we need to introduce the idea of bias. Bias is used when we have an estimator, \(\hat{y}\), that is used to predict another value \(y\). For example you might want to predict the rainfall, in inches, for your city in a given year. The bias is the average difference between that estimator and true value. Mathematically we just write bias as:


Note that unlike variance, bias can be either positive or negative. If your rainfall predictions for your city had a bias of 0, it means that you're just as likely to predict too much rain as to predict too little. If your bias was positive it means that you tend to predict more rain than actually occurs. Likewise, negative bias means that you underpredict. However, bias alone says little about how correct you are. Your forecasts could always be wildly wrong, but if you predict a drought during the rainiest year and predict floods during the driest then your bias can still be 0.

What is Modeling?

Before we dive too deeply into our problem, we want to be very clear about what we're even trying to do whenever we model data. Even if you're an expert in machine learning or scientific modeling, it's always worth it to take a moment to make sure everything is clear. Whether we're just making a random guess, building a deep neural network, or formulating a mathematical model of the physical universe, anytime we're modeling we are attempting to understand some process in nature. Mathematically we describe this process as an unknown (and perhaps unknowable) function \(f\). This could be the process of curing meat, the complexity of a game of basketball, the rotation of the planets, etc.

Describing the world

To understand this process we have some information about how the world works and we know what outcome we expect from the process. The information about the world is our data and we use the variable \(x\) to denote this. It is easy to imagine \(x\) as just a single value, but it could be a vector of values, a matrix a values or something even more complex. And we know that this information about the world produces a result \(y\). Again we can think of \(y\) as a single value, but it can also be more complex values. This gives us our basic understanding of things, and we have the following equation:

$$y = f(x)$$

So \(x\) could the height a object is dropped, \(f\) could be the effects of gravity and the atmosphere on the object and \(y\) the time it takes to hit the ground. However, one very important thing is missing from this description of how things work. Clearly if you drop an object and time it, you will get slightly different observations, especially if you drop different objects. We consider this the "noise" in this process and we note that with a variable \(\epsilon\) (lower case "epsilon" for those unfamiliar). Now we have this equation:

$$y = f(x) + \epsilon$$

The \(\epsilon \) value is considered to be normally distributed with some standard deviation \(\sigma\) and a mean of 0. We'll note this as \(\mathcal{N}(\mu=0,\sigma)\). This means that a negative and a positive impact from this noise are considered equally likely, and that small errors are much more likely than extreme ones.

Our model

This equation is just for how the world actually works given \(x\) that we know and \(y\) that we observe. However, we don't really know how the world works so we have to make models. A model, in it's most general form, is simply some other function \(\hat{f}\) that takes \(x\) and should approximate \(y\). Mathematically we might say:

$$\hat{f}(x) \approx y$$

However it's useful to expand out \(y\) into \(f(x) + \epsilon\). This helps us see that \(\hat{f}\) is not approximating \(f\), that is our natural process, directly.

$$\hat{f}(x_i) \approx f(x_i) + \epsilon_i $$

Instead we have to remember that \(\hat{f}\) models \(y\) which includes our noise. In practical modeling and machine learning this is an important thing to remember. If our noise term is very large, it can be difficult or even impossible to meaningfully capture the behavior of \(f\) itself.

Measuring the success of our model with Mean Squared Error.

Once we have implemented a model we need to check how well it preforms. In machine learning we typically split of our data into training and testing data, but you can also imagine a scientist building a model on past experiments or an economist making a model based on economic theories. In whichever case, we want to test our the model and measure it's success. When we do this we have a new set of data, \(x\) with each values in that set being indexed by value so we label them all \(x_i\). To evaluate the model we need to compare this to some \(y_i\). The most common way to do this for real valued data is to use Mean Squared Error (MSE). MSE is exactly what it sounds like:

- We take the error (i.e. difference from \(\hat{f}(x_i)\) and \(y\))

- Square this value (making negative in positive the same, and greater error gets more severe penalty)

- Then we take the average (mean) of these results.

Mathematically we express this as:

$$MSE = \frac{1}{n}\Sigma_{i=1}^n (y -\hat{f}(x_i))^2 $$

Of course the mean of our squared error is also the same thing as the expectation of our squared error so we can go ahead and simplify this a bit:


It's worth noting that if our model simply predicted the mean of \(y\), \(\mu_y\) for every answer, then our MSE is the same as the variance of \(y\), since one way we can define variance is as \(E[(y-\mu)^2]\)

Unpacking Mean Squared Error

Already MSE has some useful properties that are similar to the general properties we explored when we discussed why we square values for variance. However, even more interesting is that inside this relatively simple equation are hidden the calculations of bias and variance for our model! To find these we need to start by expanding our MSE a bit more. We'll start by replacing \(y\) with \(f(x) + \epsilon\):

$$MSE = E[(f(x)+ \epsilon - \hat{f}(x))^2]$$

Before moving on there is one rule of expectation we need to know so that we can manipulate this a bit better:

$$E[X+Y] = E[X] + E[Y]$$

That is, adding two random variables and computing their expectation is the same as computing the expectation of each random variable and then computing that. Of course, we can't use this on our equation yet because all of the added terms are squared. To do anything more interesting we need to expand this out fully:

$$MSE = E[(f(x)+ \epsilon - \hat{f}(x))^2] =E[(f(x)+ \epsilon - \hat{f}(x))\cdot (f(x)+ \epsilon - \hat{f}(x))] = \ldots $$

$$E[-2f(x)\hat{f}(x) + f(x)^2 + 2\epsilon f(x) + \hat{f}(x)^2 - 2\epsilon \hat{f}(x) + \epsilon^2]$$

Now this looks like a mess, but let's start pulling this apart in terms of expectations of each section were adding or subtracting.

At the end we have \(\epsilon^2\). We know that this our noise which we already stated is normally distributed with a mean, \(\mu=0\), and an unknown standard deviation of \(\sigma\). Recall that \(\sigma^2 = \text{Variance}\). The definition of Variance is:

$$E[X^2] - E[X]^2$$

So the variance for our \(\epsilon\) is:

$$Var(\epsilon) = \sigma^2 = E[\epsilon^2] - E[\epsilon]^2$$

But we can do a neat trick here! We just said that \(\epsilon \) is sampled from a Normal distribution with mean of 0, and the mean is the exact same thing as the expectation so:

$$E[\epsilon]^2 = \mu^2 = 0^2 = 0$$

And because of this we know that:

$$E[\epsilon^2] = \sigma^2 $$

We can use this to pull out the \(\epsilon^2\) term from our MSE definition:

$$MSE = \sigma^2 + E[-2f(x)\hat{f}(x) + f(x)^2 + 2\epsilon f(x) + \hat{f}(x)^2 - 2\epsilon \hat{f}(x)]$$

Which still looks pretty gross! However, if we pull out all the remaining terms with \(\epsilon \) in them we can clean this up very easily:

$$MSE = \sigma^2 + E[2\epsilon f(x)- 2\epsilon \hat{f}(x)] + E[-2f(x)\hat{f}(x) + f(x)^2 + \hat{f}(x)^2]$$

Again, we know that the expected value of \(\epsilon \) is 0, so all the terms in \(E[2\epsilon f(x)- 2\epsilon \hat{f}(x)]\) will end up a 0 in the long run! Now we've cleaned up quite a bit:

$$MSE = \sigma^2 + E[-2f(x)\hat{f}(x)+f(x)^2 + \hat{f}(x)^2]$$

And here we see that one of the terms in our MSE reduced to \(\sigma^2\), or the variance in the noise in our process.

A note on noise variance

It is important to stress here how important this insight is. The variance of the noise in our data is an irreducible part of our MSE. No matter how clever our model is, we can never reduce our MSE to being less than the variance related to the noise. It's also worthwhile thinking about what "noise" is. In orthodox statistics we often think of noise as just some randomness added to the data, but as a Bayesian I can't quite swallow this view. Randomness isn't a statement about data, it's a statement about our state of knowledge. A better interpretation is that noise represents the unknown in our view of the world. Part of it may be simply measurement error. Any equipment measuring data has limitations. But, and this is very important, it can also account for limitations in our \(x\) to fully capture all the information needed to adequately explain \(y\). If you are trying to predict the temperature in NYC from just the atmospheric CO2 levels, there's still far to much unknown in your model.

Model variance and bias

So now that we've extracted out the noise variance, we still have a bit of work to do. Luckily we can simplify the inner term of the remaining expectation quite a bit:

$$-2f(x)\hat{f}(x)+f(x)^2 + \hat{f}(x)^2 = (f(x)-\hat{f}(x))^2 $$

With that simplification we're now left with:

$$MSE = \sigma^2 + E[(f(x)-\hat{f}(x))^2]$$

Now we need to do one more transformation that is really exciting! We just defined

$$Var(X) = E[X^2] - E[X]^2$$

Which means that:

$$Var(f(x)-\hat{f}(x)) = E[(f(x)-\hat{f}(x))^2] - E[f(x)-\hat{f}(x)]^2$$

A simple reordering of these terms solves this for our \(E[(f(x)-\hat{f}(x))^2])\):

$$E[(f(x)-\hat{f}(x))^2]=Var(f(x)-\hat{f}(x)) + E[f(x)-\hat{f}(x)]^2$$

And now we get a much more interesting formulation of MSE:

$$MSE = \sigma^2 + Var(f(x)-\hat{f}(x)) + E[f(x) - \hat{f}(x)]^2$$

And here we see two remarkable terms just pop out! The first, \(Var(f(x)-\hat{f}(x))\) is, quite literally, the variance in our predictions from the true output of our process. Notice that even though we can't differentiate \(f(x)\) and \(\epsilon\), and our model cannot either, baked in to MSE is the true variance between the non-noisy process \(f\) and \(\hat{f}\). The second term that pops out is \(E[f(x) - \hat{f}(x)]^2\), this is just the bias squared. Remember that unlike variance, bias can be positive or negative (depending on how the model is biased), so we would need to square this value in order to make sure it's always positive.

With MSE unpacked we can see that Mean Squared Error is quite literally:

$$\text{Mean Squared Error}=\text{Model Variance} + \text{Model Bias}^2 + \text{Irreducible Uncertainty}$$

Simulating Bias and Variance

It turns out that we actually need MSE is able to capture the details of bias, variance and the uncertainty in our model because *we cannot directly observe these properties ourselves*. In practice there is no way to know explicitly what \(\sigma^2\) is for our \(\epsilon\). This means that we cannot directly determine what the bias and variance are exactly for our model. Since \(\sigma^2\) is constant, we do know that if we lower our MSE we must have lowered at least one of these other properties of our model.

However to understand these concepts better we can simulate a function \(f\) and apply our own noises sampled from a distribution with a known variance. Here is a bit of R code that will create our a \(y\) value from a function \(f(x) = 3\cdot x + 6\) with \(\epsilon \in \mathcal{N}(\mu=0,\sigma=3) \):

f <- function(x){
x <- seq(0,10,by=0.05)
e <- rnorm(length(x),mean=0,sd=3)
y <- f(x) + e

We can visualize our data and see that we get what we would expect, points scattered around a straight line:


Because we know the true \(f\) and \(\sigma\) we can experiment with different \(\hat{f}\) models and see their bias and variance. Let's start with a model that is the same as \(f\) except that it has a different y-intercept:

f.hat.1 <- function(x){

We can plot this out and see how it looks compared to our data.

Our high bias model is consistently off in its estimate.

Our high bias model is consistently off in its estimate.

As we can see, the blue line representing our model follows our data pretty closely but it systematically underestimates our values. We can go ahead and calculate the MSE for this function and, because we know what these values are, we can also calculate the bias and variance:

  • MSE: 13

  • bias: -2

  • variance: 0

So \(\hat{f}_1\) has a negative bias because it under estimates the points more often than it over estimates them. But, compared to the true \(f\),it does so consistently, so therefore has 0 variance with \(f\). Given that we have a MSE of 13, a bias of -2 and 0 variance, we can also calculate that our \(\sigma^2\) for our uncertainty is

$$\sigma^2 = \text{MSE} - (\text{bias}^2 + \text{variance})= 13 - (-2^2 + 0) = 9$$

Which is exactly what we would expect given that we set our \(\sigma\) parameter to 3 when sampling our ā€˜eā€™ values.

We can come up with another model, \(\hat{f}_2\), which simply guesses the mean of our data, which is 21.

f.hat.2 <- function(x){

With this simple model we get just a straight line cutting our data in half:

When we use the mean to predict the data, nearly all of our error is cause by the variance in our model.

When we use the mean to predict the data, nearly all of our error is cause by the variance in our model.

When we look at the MSE, bias and variance for this model we get very different results from last time:

  • MSE: 84

  • bias: 0

  • variance: 75

Given that our new model predicts the mean it shouldn't be surprising to see that this model is has no bias since it underestimates \(f(x)\) just as much as it overestimates. The tradeoff here is that the variance in how the predictions differ between \(\hat{f_2}\) and \(f\) is very high. And of course when we subtract out the variance from our MSE we still have the same \(\sigma^2=9\).

For our final model,\(\hat{f}_3\), we'll just use R's built-in `lm` function to build a linear model:

lm(y ~ x)

When we plot this out we can see that, by minimizing the MSE, the linear model was able to recover our original \(f\):

When we use the actual model that generates the data we get 0 variance and 0 bias, but still have error due to uncertainty.

When we use the actual model that generates the data we get 0 variance and 0 bias, but still have error due to uncertainty.

Since you true function \(f\) is a linear model it's not surprising that \(\hat{f_3}\) is able to learn it. By this point you should be able to guess the results for this model:

  • MSE: 9

  • bias: 0

  • variance: 0

The MSE remains 9 because, even when we're cheating and know the true model, we can't learn better than the uncertainty in our data. So \(\sigma^2\) stands as the limit for how good we can get our model to perform.

The Bias-Variance trade off

As we can see, if we want to improve our model we have to decrease either bias or variance. In practice bias and variance come up when we see how our models perform on a separate set of data than what they were trained on. In the case of testing your model out on different data than it was trained on, bias and variance can be used to diagnosis different symptoms of issues that could be wrong with your model. High bias on test data is typically caused because your model failed to find the \(f\) in the data and so it systematically under or over predicts. This is called "underfitting" in machine learning because you have failed to fit your model to the data.

High variance when working with test data indicates a different problem. High variance means that the predictions your model makes very greatly with what the known results should be in the test data. This high variance indicates your model has learned the training data so well that it is started to mistake the \(f(x) + \epsilon\) for just the true \(f(x)\) you wanted to learn. This problem is called "overfitting", because you are fitting your model to closely to the data.


When we unpack and really understand Mean Squared Error we see that it is much more than just a useful error metric, but also a guide for reasoning about models. MSE allows us, in a single measurement, to capture the ideas of bias and variance in our models, as well as showing that there is some uncertainty in our models that we cannot get rid. The real magic of MSE is that it provides indirect mathematical insight about the behavior of the true process, \(f\), which we are never able to truly observe.

If you enjoyed this post please subscribe to keep up to date and follow @willkurt!

If you enjoyed this writing and also like programming languages, you might enjoy my book Get Programming With Haskell from Manning. Expected in July 2019 from No Starch Press is my next book Bayesian Statistics the Fun Way