Bayesian A/B Testing: A Hypothesis Test that Makes Sense

This post is part of our Guide to Bayesian Statistics

We've covered the basics of Parameter Estimation pretty well at this point. We've seen how to use the PDF, CDF and Quantile function to learn the likelihood of certain values, and we've seen how we can add a Bayesian prior to our estimate. Now we want to use our estimates to compare two unknown parameters.

Keeping with our email example we are going to set up an A/B Test. We want to send out a new email and see if adding an image to the email helps or hurts the conversion rate. Normally when the weekly email is sent out it includes some image, for our test we're going to send one Variant with the image like we always do and another without the image. The test is called an A/B Test because we are comparing Variant A (with image) and Variant B (without).

We'll assume at this point we have 600 subscribers. Because we want to exploit the knowledge gained during our experiment we're only going to be running our test on 300 of these subscribers, that way we can give the remaining 300 what we believe to be the best variant. The 300 people we're going to test will be split up into two groups, A and B. Group A will receive an email like we always send, with a big picture at the top, and group B's will not have the picture.

Next we need to figure out what prior probability we are going to use. We've run an email campaign every week so we have a reasonable expectation that the probability of the recipient clicking the link to the blog on any given email should be around 30%. To make things simple we'll use the same prior for A and B. We'll also choose a pretty weak version of our prior because we don't really know how well we expect B to do, and this is a new email campaign so maybe other factors would cause a better or worse conversion anyway. We'll settle on Beta(3,7):

Different Beta distributions can represent varying strengths in belief in known priors

Different Beta distributions can represent varying strengths in belief in known priors

Next we need our actual data. We send out our emails and get these responses:

Our observed evidence

Our observed evidence

Given what we already know about parameter estimation we can look at each of these variants as two different parameters, we're trying to estimate. Variant A is going to be represented by Beta(36+3,114+7) and Variant B by Beta(50+3,100+7) (if you're confused by the +3 and +7 they are our Prior, which you can refresh on in the post on Han Solo). Here we can see the estimates for each parameter side by side:

The overlap between the distributions is what we care about.

The overlap between the distributions is what we care about.

Clearly our data suggests that Variant B is the superior variant. However, from our ealier discussion on Parameter Estimation we know that the true conversion rate can be a range of possible values. We can also clearly see here that there is an overlap between the possible true conversion rates for A and B. What if we got unlucky in our A responses and A's true conversion rate is in fact much higher? What if we were also really lucky with B and its conversion rate is in fact much lower? If both of these conditions held it is easy to see a possible world in which A is the better variant even though it did worse on our test. The real question we have is how likely is it that B is actually the better variant?

Monte Carlo to the Rescue!

I've mentioned before that I'm a huge fan of Monte Carlo Simulations, and so we're going to tackle this question using a Monte-Carlo Simulation. R has a rbeta function that allows us to sample from a Beta distribution. We can now literally ask, by simulation, "What is the probability that B is actually superior to A". We'll simply sample 100,000 times from each distribution we have modeled here and see what it tells us:

n.trials <- 100000
prior.alpha <- 3
prior.beta <- 7
a.samples <- rbeta(n.trials,36+prior.alpha,114+prior.beta)
b.samples <- rbeta(n.trials,50+prior.alpha,100+prior.beta)
p.b_superior <- sum(b.samples > a.samples)/n.trials

We end up with:

p.b_superior = 0.96

This is equivalent to getting a p-value of 0.04 from a single-tailed T-test. In terms of classical statistics, we would be able to call this result "Statistically Significant"! So why didn't we just use a T-test then? For starters I'm willing to bet these few lines of code are dramatically more intuitive to understand than Student's T-Distribution. But there's actually a much better reason.

Magnitude is more important than Significance

The focus of a classic Null-Hypothesis Significance Tests (NHST) is to establish whether two different distributions are likely to be result of sampling from the same distribution or not. Statistical Significance can at most tell us "these two things are not likely the same" (this is what rejecting the Null Hypothesis is saying). That's not really a great answer for an A/B Test. We're running this test because we want to improve conversions. Results that say "Variant B will probably do better" are okay, but don't you really want to know how much better? Classical statistics tells us Significance, but what we're really after is Magnitude!

This is the real power of our Monte-Carlo Simulation. We can take the exact results from our last simulation and now look at how much of an improvement Variant B is likely to be. Now we'll simply plot the ratio of \(\frac{\text{B Samples}}{\text{A Samples}}\), this will give us a distribution of the relative improvements we've seen in our simulations.

This histogram describes all the possible differences between A and B

This histogram describes all the possible differences between A and B

From this histogram we can see that our most likely cases is about a 40% improvement over A, but we can see an entire range of values. As we discussed in our first post on Parameter Estimation, the Cumulative Distribution Function (CDF) is much more useful for reasoning about our results.

The line here represents the median improvement seen in the simulation

The line here represents the median improvement seen in the simulation

Now we can see that there is really just a small, small chance that A is better, but even if it is better it's not going to be better by much. We can also see that there's about a 25% chance that Variant B is a 50% or more improvement over A, and even a reasonable chance it could be more than double the conversion rate! Now in choosing B over A we can actually reason about our risk: "The chance that B is 20% worse is roughly the same that it's 100% better." Sounds like a good bet to me, and a much better statement of our knowledge than "There is a Statistically Significant chance that B is better than A."

Conclusion

There are many discussion of A/B Testing that you can find that would give dramatically different methodology than what we have done here. Orthodox Null Hypothesis Significance Testing differs in more ways than simply using a T-Test, and will likely be the topic of a future post. The key insight here is that we have shown how the ideas of Hypothesis Testing and Parameter Estimation can be viewed, from a Bayesian perspective, as the same problem. Additionally I have found that there is no mystery in the approach outlined here. Every conclusion we draw is based on data (including our prior) and the basics of Probability. Through this and the other two posts we have built up a Hypothesis Testing framework entirely from first principles. I'll leave deriving Student's T- distribution as an exercise for the reader.

If you enjoyed this post please subscribe to keep up to date and follow @willkurt!