Replacing an A/B Test with GPT

A good chunk of my career has involved running, analyzing and writing about A/B tests (here’s a quick Bayesian overview of A/B testing if you aren’t familiar). Often A/B tests are considered to be the opposite, statistical, end of the data science spectrum from AI and machine learning. However, stepping back a bit, an A/B test just tells you the probability of an outcome (whether variant A is better than variant B) which is not that different than a deep neural network used for classification telling you the probability of a label.

With the rapid advances in Natural Language Processing (NLP) including easy access to pre-trained models from Hugging Face and the quite impressive results coming out of OpenAI’s Large Language Models (LLMs) like GPT4, I was curious whether or not you could replace a subject line A/B test with a model built using GPT. It turns out that using GPT-3’s text embeddings, and a very simple classification model, we can create a tool that correctly predicts the winner of an A/B test 87% of the time.

The Problem: Picking the Best Headline

The vast majority of the content on the web today exists to drive traffic to websites, ultimately to increase the probability that users will complete some conversion event. Conversion events can be everything from simply clicking on a link to making the purchase of a product associated with the content. Even here at Count Bayesie, I ideally want people to at least read this content even if I have nothing to sell.

In this post we’ll explore picking an article headline that generates the highest click rate for the post. For example suppose we had these two ideas for headlines (which come from our data set):

A: When NASA Crunched The Numbers, They Uncovered An Incredible Story No One Could See.

B: This Stunning NASA Video Totally Changed What I Think About That Sky Up There.

Which of these two headlines do you think is better? A/B tests are designed to answer this question in a statistical manner: both headlines will be displayed randomly to users for a short time, and then we use statistics to determine which headline is more likely the better one.

While many consider a randomized controlled experiment (i.e. an A/B test) to be the gold standard answering for the question “which headline is better?”, there are a range of draw backs to running them. The biggest one I’ve found professionally is Marketers hate waiting for results! In addition they don’t want to have to expend some of the eyeballs that view the content on experiment itself. If one variant is much worse, but it took you thousands of users to realize that, then you’ve wasted a potentially large amount of valuable conversion.

This is why it would be cool if AI could predict the winner of an A/B test so that we don’t have to run them!

The Data

The biggest challenge with any machine learning problem is getting the data! While many, many companies run A/B tests frequently, very few publish the results of this (or even revisit their own data internally).

Thankfully in 2015 Upworthy published the Upworthy Research Archive which contained data for 32,487 online experiments.

Upworthy Research Archive data

32,487 experiments might not seem like all that much, but each experiment is not just the comparison between two headlines but often many. Here in an example of the data from a single experiment:

A single experiment involves the comparison of multiple variants.

We want to transform this dataset into rows where each row represents a comparison between a single pair of A and B variants. Using the combinations function from the itertools package in Python makes this very easy to calculate. Here’s an example of using this function to create all possible comparisons among 4 variants:

> from itertools import combinations
> for pair in combinations(["A","B","C","D"], 2):
>     print(pair)

('A', 'B')
('A', 'C')
('A', 'D')
('B', 'C')
('B', 'D')
('C', 'D')

After this transformation (plus some clean up) we have 104,604 examples in our train set and 10,196 examples in our test set out of the original 32,487 experiments.

This leads us to a very interesting problem regarding this data: what exactly are our labels and how do we split the data up into train and test?

The tricky part of labels and train/test split

It’s worth recalling that the entire reason we run an A/B test in the first place is we don’t know which variant is better. The reason statistics is so important in this process is we are never (or at least rarely) 100% certain of the result. At best, if we view these experiments in a Bayesian way, we only end up with the probability A is better than B.

In most classification problems we assume binary labels, but when we transform our data the labels looks like this:

If we represent our labels honestly, they are probabilistic labels, which makes our problem a bit different than usual.

As we’ll see in a bit, there’s a very simple way we can use logistic regression to learn from uncertain labels in training.

For our test set however, I really want to get a sense of how this model might perform with clear cut cases. After all if the differences between two headlines is negligible neither this model nor an A/B test would help us choose. What I do care about is that if an A/B test can detect a difference, then our model does as well.

To ensure that our test set can be label accurately, we only chose the pairs were the difference was a high degree of certainty (i.e. very close to 0 or very close to 1). To make sure there was no data leakage all of the titles that are in the test set are removed from the training set.

The Model

Modeling this problem is both relatively simple and quite different from most other classification models, which is a major reason I was so fascinated by this project. The simplicity of this model is intentional, even though it’s not hard to imagine modifications that could lead to major improvements. The reason for the simplicity of the model is that I’m primarily interested in testing the effectiveness of language models rather than this problem specific model.

Here is a diagram of the basic structure of our model.

The basic flow of our model, the key insight is computing the difference of the vector representations.

We’ll walk through each section of the model process touching on quite a few interesting things going on in what is otherwise a pretty simple model to understand.

Embeddings: Bag-of-Words, Distilbert, GPT3 (ada-002)

The heart of this model is embeddings or how we’re going to transform our text into a vector of numeric values so that we can represent headlines mathematically for our model. We are going to use three different approaches to this and that will be the only way each model differs. Our approaches are:

Each of these techniques is the only difference between the three A/B test prediction models we’re going to build. Let’s step through the basic construction of each:

Bag-of-Words

A “bag of words” vector representation treats each headline merely as a collection of words and concerns itself with the possible words in the training set. Each word (or in this case, technically a two-word sequence called a bi-gram) present in the headline will result in a value of 1 in the vector representing the headline, every other value in the vocabulary not present in the headline will be a 0. Our text can easily be transformed this way with SKLearn as follows:

from sklearn.feature_extraction.text import CountVectorizer

def build_bow_vectorizer(df):
    corpus = np.concatenate([df['headline_a'].values,
                             df['headline_b'].values])
    vectorizer = CountVectorizer(ngram_range=(1,2),
                                 binary=True,
                                 max_df=0.6,
                                 min_df=0.005)
    vectorizer.fit(corpus)
    return vectorizer

Typically when we refer to an “embedding” we aren’t considering any vector representation, but specifically on that is the output of the last layer of a neural network trained to model language (often for some other task). Technically our BoW model would not be considered a true embedding.

🤗 Transformers library and Distilbert

HuggingFace’s Transformer’s library is a powerful tool that allow us to use pre-trained language models to create word embeddings. This is very important because it allows us to leverage the power of large language models train on an enormous corpus of text to make our headline representation very information rich. Using an existing model to build a task specific model is referred to as Transfer learning and is a major revolution in what is possible with machine learning.

What Hugging face allows us to do is to run our text through an existing neural network (specifically a Transformer) and retrieve the activations of the last hidden state in the model, then use these for our embeddings. The process is a bit more involved than our BoW encoding, but here is an example function for extracting the hidden state (adapted from Natural Language Processing with Transformers):

def extract_hidden_states(batch):
    # Place Model inputs on the GPU
    inputs_a = {k:v.to(device) for k, v in batch.items()
                if k in ['input_ids_a', 'attention_mask_a']}
    inputs_a['input_ids'] = inputs_a.pop('input_ids_a')
    inputs_a['attention_mask'] = inputs_a.pop('attention_mask_a')
    
    inputs_b = {k:v.to(device) for k, v in batch.items()
                if k in ['input_ids_b', 'attention_mask_b']}
    inputs_b['input_ids'] = inputs_b.pop('input_ids_b')
    inputs_b['attention_mask'] = inputs_b.pop('attention_mask_b')
    # Extract last hidden states
    with torch.no_grad():
        last_hidden_state_a = model(**inputs_a).last_hidden_state
        last_hidden_state_b = model(**inputs_b).last_hidden_state
        
    return {"hidden_state_a": last_hidden_state_a[:,0].cpu().numpy(),
            "hidden_state_b": last_hidden_state_b[:,0].cpu().numpy()
           }

The specific model we’re using is a version of the Distilbert transformer, which is a very powerful language model, but not nearly as large and powerful as GPT-3

GPT-3 using OpenAI’s API

Our last set of embeddings comes from OpenAI’s GPT-3 using their API to get the embeddings. GPT-3 is a remarkably powerful transformer that has been in the news so much it's hard to imagine one has not already heard too much about it! Not only is the model powerful, but the API is remarkably simple to use in Python. Here is an example of some code fetching embeddings for two headlines:

resp = openai.Embedding.create(input=[headline_a, headline_b],
                               model=EMBEDDING_MODEL)
embedding_a = np.array(resp['data'][0]['embedding'])
embedding_b = np.array(resp['data'][1]['embedding'])

The catch of course for all this power and ease of use is that it’s not free. However my total bill for running this model and some other experiments ended up being under a dollar! Nonetheless making sure that I was caching and saving my results to avoid being billed twice for the same task did add a bit to the code complexity. However of the three embedding solutions this was the easiest to implement.

It is worth pointing out that we’re not prompting GPT-3 with questions about our headlines but using embeddings that are derived from it. This is an important use case for these powerful models that I currently haven’t seen discussed too much in the vast floods of articles on the topic.

Modeling the difference between two headlines

Now that we have a way to represent all of our headlines as vectors we have a new modeling problem: How are we going to represent the difference between these two headlines?

We could concatenate them and let the model worry about this, but my goal here is to understand the impact of the embeddings alone, not to worry about a more sophisticated model. Instead we can solve this the way that many models handle comparisons: use the literal difference between the two vectors.

By subtracting the vector representing headline B from the vector representing headline A we get a new vector representing how these headlines are different; using that as the final vector representation for our model.

To understand how this works consider this simplified example:

Here we have headlines 0 and 1 represented by a very simple vector consisting of three features: the word count, whether or not the headline contains emojis and whether or not the headline ends with an exclamation mark. Now let’s see what the result of subtracting these vectors is:

> ex_df.iloc[0,:].values - ex_df.iloc[1,:].values
array([-4, -1,  0])

We can interpret the resulting vector as:

  • headline 0 is four words shorter than headline 1

  • headline 0 does not have emojis and headline 1 does

  • headline 0 and headline 1 either both don’t or both do end in exclamation marks.

In this case a model might learn that emojis are good, so headline 0, would be penalized because it does not have emojis.

Of course our representations are much more complex, however the intuition behind modeling the difference remains the same.

Our classifier: Logistic regression as regression

Despite some fairly noteworthy educators making the claim that “logistic regression is a model for classification, not regression” the model we’ll end up using demonstrates both that logistic regression quite literally is regression and that the distinction between “classification” and “regression” is fairly arbitrary.

Thinking about our problem it seems perfectly well suited for Logistic regression, after all we just want to predict the probability of a binary outcome. However if we try this in SKLearn we get an interesting problem:

> from sklearn.linear_model import LogisticRegression

> base_model = LogisticRegression().fit(X_train, y_train)

ValueError Traceback (most recent call last)...

ValueError: Unknown label type: 'continuous'

SKLearn shares the same erroneous assumption that many others in the machine learning community have, that somehow Logistic regression can only predict binary outcomes. However this is for good reason. When we explored logistic regression in this blog we discussed how logistic regression can be viewed as a mapping of Bayes’ Theorem to the standard linear model. We focused on the model as this:

$$O(H|D) = \frac{P(D|H)}{P(D|\bar{H})}O(H)$$

Which can be understood in terms of a linear model and the logit function as:

$$\text{logit}(y) = x\beta_1 + \beta_0$$

However this does not work in practice most of the time precisely because we are regressing on values of exactly 1.0 and 0.0, for which the logit function is undefined. So instead we we use a formula based on an alternate (and much more common) view of logistic regression:

$$y = \text{logistic}(x\beta_1 + \beta_0)$$

By understanding the nature of Logistic regression as regression we can very easily implement a variation of Logistic regression that does work for our data using logit and LinearRegression:

from sklearn.linear_model import LinearRegression

y_train = train_df["p_a_gte_b"].values
# We are going to perform Linear regression 
# on a y value transformed with the logit
target_train = logit(y_train)

base_model = LinearRegression().fit(X_train, target_train)

We just have to remember that our model will be outputting responses in terms of log-odds so we’ll need to transform them back to probabilities manually using the logistic function.

Results

Finally we can see how each of these different models performed! What’s interesting about this case is that our clever use of linear regression as logistic regression combined with the way we split up our test and train sets means we’ll have different ways to measure model performance depending on the data set used.

Model Performance

We transformed our probabilities in the training set into log odds and then ran them through a standard linear regression model. Because of this we can just used Mean Squared Error to compare model performance on the training set.

Mean Square Error on Train dataset

Smaller is better

While we can see a clear improvement for each progressively more powerful model, it is very difficult to have any meaningful interpretation of these results. We can’t look at common classification metrics such as accuracy and RoC AUC since we don’t know the true labels for the train data set.

For the test set we can look at these scores since the test set only consists of examples where we are highly confident in the results of the experiment. We’ll start by looking at the ROC AUC, which allows us to view the strength of our model without having to pick a particular cutoff for choosing one class or another. For those unfamiliar a 0.5 score is performance on par with random guessing and a score of 1.0 represents perfect classification.

ROC AUC - Test set

Higher is better

Here we can start seeing that these models are surprisingly good. Even the simplest model, the Bag of Words, has an ROC AUC of around 0.8, which means it is fairly good at predicting which headline will win an A/B test.

This brings up a point that I have found myself making repeatedly throughout my career in data science and machine learning: If a simple model cannot do well at solving a problem, it is extremely unlikely that a more complex model will magically perform much better.

There is a mistaken belief that if a simple model does poorly, the solution must be to add complexity. In modeling complexity should be considered a penalty, and only pursued if simple models show some promise. As an example, many people believe that a simple model like Logistic Regression could not possiblely do well on an image recognition problem like MNIST. When in fact a simple logistic model will score a 90% accuracy on the MNIST dataset.

When I first saw the BoW model doing well, I already was optimistic for GPT3 which does frankly fantastic in terms of ROC AUC. But now let’s look at what really matters in practice: accuracy!

Accuracy on Test set

These results are quite remarkable! It is worth noting that our test set essentially represents the easiest cases to determine the winning variant (since we’ll be more easily more certain when two variants are clearly different), however it’s still impressive that our GPT3 model is able to correctly predict the winner of an A/B test in 87% of these cases.

While impressive, it’s also important to consider that when we run an A/B test “statistical significance” is generally being 95% sure that it looks like the difference is zero (or, if we approach this as Bayesian, 95% sure one variant is superior), and these specific A/B tests were much more certain of these results than the model on average.

Our best model still seems quite useful. Another way to explore this is to see how well calibrated our model’s probabilities are.

Probability calibrations

My diehard Bayesian readers might be a bit offended by my next question about these models but I do want to know “If you say you’re 80% confident, are you correct about 80% of the time?”. While this is a particularly Frequentist interpretation of the models output probabilities, it does have a practical application. It’s quite possible for a model to have high accuracy but have all it’s predictions very close to 0.5 which makes it hard for us to know if it’s more sure about any of it’s predictions.

To answer this question I’ve plotted out the average accuracy for intervals of 0.05 probability. Here’s the result we get for our GPT3 model:

We want to see a “V” shape in our results because a 0.1 probability in winning reflects the same confidence as 0.9

Notice that the ideal pattern here is a “V” shape. That’s because being 10% sure A is not the winner is the same as being 90% sure that B is the winner. Our maximum state of uncertainty is 0.5

As we can see, our GPT model is a bit under-confident in it’s claims. That is when the model is roughly 80% sure that A will win, it turns out that it’s correct in calling A the winner closer to 95% of the time.

Demo: Choosing the headline for this post

While I’m not quite sure I’m ready to recommend using LLMs instead of running proper A/B tests, there are plenty of cases where one might want to run an A/B but realistically cannot. This post is a great example! I don’t really have the resources (or the interest) in running a proper A/B test for the titles of this post… so I figured I would give my model a shot!

My original plan for this post title was “Can GPT make A/B Testing Obsolete?”, I thought this sounded maybe a bit “click-baity”, so I compared it with the current title. Here’s the basic code for running an A/B test with the model:

def ab_test(headline_a, headline_b):
    resp = openai.Embedding.create(input=[headline_a, headline_b],
                                   model=EMBEDDING_MODEL)
    embedding_a = np.array(resp['data'][0]['embedding'])
    embedding_b = np.array(resp['data'][1]['embedding'])
    diff_vec = embedding_a - embedding_b
    return logistic(base_model.predict([diff_vec]))

And the result of running this comparison turned out not so great for my original title:

> ab_test("Can GPT make A/B Testing Obsolete?",
>         "Replacing an A/B test with GPT")

array([0.2086494])

While not “statistically significant” I also know that this model does tend to under estimate itself, so went with the headline you see above.

Interestingly enough when I fed the first part of this article to GPT-3 itself and told it to make it’s own headline I got a remarkably similar one: "Replacing A/B Testing with AI: Can GPT-4 Predict the Best Headline?"

Running this through the model it seemed not to have a preference:

> ab_test("Replacing an A/B test with GPT",
>        "Replacing A/B Testing with AI: Can GPT-4 Predict the Best Headline?")

array([0.50322519])

Maybe you don’t really want to replace all of your A/B testing with LLMs, but, at least for this case, it was a good substitute!

Conclusion: Is an A/B test different than a Model?

I wouldn’t be surprised if many people who have spent much of their careers running A/B tests would read this headline and immediately think it was click-bait nonsense. This experiment, for me at least, does raise an interesting philosophical (and practical) question: What is the difference between a model that tells you there’s a 95% chance A is greater than B and an A/B test that tells you that there’s a 95% chance A is greater than B? Especially if the former takes only milliseconds to run and the later anywhere from hours to days. If your model is historically correct 95% of the time when it says 95% how is this different from an A/B test making same claim based on observed information?

Even though I’m very skeptical of big claims around true “AI” in these models, there’s no doubt that they do represent an unbelievable amount of information about the way we use language on the web. It’s not absurd to consider than GPT-3 (and beyond) do have a valid understanding of how to represent these headlines in high dimensional space such that a linear model is able to accuracy predict how well they will perform on real humans.

The really fascinating proposition to me is that if we consider probabilities from a model the same as probabilities from an experiment but it takes milliseconds for the model to work it dramatically changes the space that A/B testing is possible. A generative model like GPT-4 could iterate on thousands of headlines, while a model like ours could run massive simulated “experiments” to find the best of the best.

While this may sound amazing to marketers and data scientists it’s worth considering the effect this would have on the content we consume. Even if this did work, do you want to live in a world where every piece of content you consume in perfectly optimized to encourage you to consume it?

Support on Patreon

Support my writing on Patreon and gain access to the source code and video commentary for this article as well as access to much more of my writing!