Many introductory treatments of Probability and Statistics focus very heavily on the "how" (e.g. how do I compute Standard Deviation), but leave many of the "why" questions untouched. This is often because to truly unravel the "why" is considered too difficult and best left for advanced study. One of my ambitions with this blog is to present many of the insights gained from a rigorous, measure theoretic study of probability in a way that is easily understandable. I believe that in an age of ubiquitous computing the "why" is fundamentally more important than the "how".

I want to briefly revisit Expectation and Variance and give a quick overview of the way these topics evolve from their often algorithmic initial treatment to their more rigorous mathematical definitions. This overview should fill in any remaining gaps in understanding for anyone with previous exposure to these topics that doesn't quite jive with these earlier posts here.

Average, Mean and Expectation

The idea of the "average" of a set of data points is most people's first glimpse into the world of Probability. When we are taught to "average" something we are usually taught how to calculate the Arithmetic Mean of a set of data points. The plain English expression of this is to "sum up all the values and divide them by the total number of values". It is usually presented something like this:

$$\text{mean} = \frac{{x_{1} + x_{2} + \ldots + x_{n}}}{n}$$

Our first exposure typically doesn't include the $x_i$ notation, preferring to use the actual values. Next we inevitably add some Greek letters to make this equation a little neater. $\mu$ (mu) is the standard symbol for mean, and we start using $\Sigma$ (capital Sigma) to make the summation notation more compact.

$$\mu = \frac{1}{n}\sum_{1}^{n}x_i$$

It is worth noting that the difference in these two formulas is purely notional. No new concepts have been introduced at this point.

The next step is to add the idea of the probability of an event. Often at this stage we are conflating events and Random Variables (which remains a useful shortcut in real world probability problems). Rather than simply having a list of $x$s each $x_i$ represents a unique value and we associated it with a $p_i$ which is the probability of that value occurring. Now we have$$\mu = \sum_{i}^{n} x_i p_i$$

The big step here is that we're no longer exclusively thinking of the mean as an empirical measurement of our data. Instead, we are describing a theoretical property of a Probability Distribution. The next development is to make this all work for continuous values. Now we need calculus and make use of a Probability Density Function (PDF). Rather than a discrete $p$ we look at $f(x)$ where $f$ is our PDF. The mean is now:$$\mu = \int_{-\infty }^{\infty } xf(x)dx$$

Here many people stop learning about probability. This is sort of any annoying place to stop. At this point we are stuck with this muddled idea of a random variable (ie a variable that behaves randomly), that by this level in mathematical progress should seem pretty confusing. The randomly behaving variable is okay for students that don't have much experience with math, but variables should not be 'random'. What started as a teaching aid has become something that's both magical and confusing.

Additionally we have the problem that we need two separate models for Discrete and Continuous Probability Distributions. An even bigger issue is that we never talked about a third type of probability distribution that involves both discrete and continuous components!

Fortunately, we have the answers to all these issues in the development of rigorous probability with measure theory! We introduce the formalized idea of a Random Variable, generalize both discrete and continuous probabilities as a sample space $\Omega$ (Omega), and use the Lebesgue Integral to sum up over the sample space. Formally $\Omega$ is a set of possible events. And the Lebesgue Integral can be understood simply as a generalization of the Integral covered in basic calculus that is more robust. Our $mu$ is finally $E[X]$ and our generalized form of expectation is:$$E[X] = \int_{\Omega} X(\omega)P(d\omega) $$

This formal definition may seem confusing, but if you followed the post on Expectation your understanding is actually closer to this than any of the other examples.

Standard Deviation and Variance

Many people's first encounter with Variance is as an annoying step between them and the Standard Deviation. This makes some degree of sense as the Normal Distribution is ubiquitous in Probability and Statistics, and the Standard Deviation is a very intuitive measure in this context. For a Normal Distribution, Standard Deviation measures the width of the distribution and makes it trivial to estimate confidence intervals given a mean.

The prime focus for most introductory instruction regarding the Standard Deviation is how to compute it:

Determining the mean of the data.
Subtract the mean from all the values in the data.
Square these differences.
Take the mean of these squares.
Finally take the square root of this new mean.

In an era of mass computing, it seems strange to focus on computation over understanding. There is no tool that I know of for working with data that doesn't include a built-in function for computing the Standard Deviation. This algorithm first approach provides no greater mathematical insight into what Standard Deviation means than simply saying "standard deviation can be calculated in Excel in the following manner...".

The next step is often to show the equation version of this algorithm. The famous $\sigma$ (sigma) enters the picture in this new definition:$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2}$$

Due to $\sigma$'s dominance in introductory statistics Variance is relegated to the identity $\sigma^2$ and of course $\sigma^2$ is just the above equation without the square root:$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$

This definition of Variance is nearly an exact translation of our algorithm into a mathematical equation. But there is one remarkably interesting insight here. It is worth noting that this equation is of nearly identical structure to first formal version of Expectation we discussed! The integral version of Variance shares the same similarity with the integral definition of Expectation:$$\sigma^2 = \int (x-\mu)^2f(x)dx$$

This similarity between Variance and Expectation can be expressed more generally and more directly as:$$Var(X) = E[(X-\mu)^2]$$

We could stop here, but for me there is still a big problem with this definition. It is still the "computation first" approach. We have pulled out some interesting things along the way but this equation's pedagogical history as being just an algorithm is still clear. When I look at this equation, I still have no idea what Variance is actually doing. Through a few steps we can transform it one last time:
$$Var(X) = E[(X - E[X])^2]$$
$$Var(X) = E[X^2 - 2X E[X] + (E[X])^2]$$
$$Var(X) = E[X^2] - 2E[X]E[X] + (E[X])^2$$
finally we arrive at...
$$Var(X) = E[X^2] - (E[X])^2$$

Which can plainly be understood as "Variance is a measurement of what happens we compare squaring the input to Expectation with squaring the output."

Conclusion

In order to effectively make use of computation, you need to understand what the computation means. In many cases the details of computation are often irrelevant (unless you are actually doing the work of implementing/optimizing these methods). Much of the confusion in teaching of statistics is that we often train people to be computers. In my view, it would be much better if everyone were mathematicians (and realized that this is not such a scary thing) and let computers worry about the computing.

If you enjoyed this post please subscribe to keep up to date and follow @willkurt!