A/B Testing

I. Implementation


A/B testing is a commonly used tool in technology companies to optimize website designs for clickthrough rates, determine optimal versions of an app, or generally test if an intervention or change has an effect on some key metric. This tool is also known as split testing, or bucket testing, but before its rebranding in tech, it was more commonly known as a randomized control trial (RCT).

In the 1920s, statistician and biologist, Ronald Fisher wanted to know the impact of experiments such as putting fertilizer on a plot of land and thus developed the first principals behind randomized controlled experiments. During his time, data was limited and statistical analysis was done by statisticians who knew how to intepret Fisher's p-value. Today, tech companies are not as limited by the quantity of data, nor do they need to wait for the end of a growing season to gather results, nor do they require highly trained statisticians to interpret results. This shift has resulted in some pitfalls in the use of basic RCT principles leading to a significantly higher false positive result. In this blog post, we will cover how to draw conclusions from A/B testing in Part I, the do's and don'ts in implementing an experiment in Part II, and finally, the Bayesian approach to this problem in part III.



To better understand A/B testing, we motivate the problem with an example. I want to increase clickthrough rates to a specific blog post that I've created, and I think I can do that by changing the image related to this blog post:

To run this experiment, I split some proportion of my audience to view the variant, and the rest will view the control version. The key here is that the experiment runs simultaneously. Any time related effect should impact both versions equally. The results are as follows:

Observed Outcomes Control Variation
Clicks 50 96
No-clicks 150 224
Total Views 200 320

To perform an A/B test, I first set a p-value threshold. Ideally I would want to tie the p-value threshold into a cost-benefit analysis for incorrectly choosing the variation over the control, or choose a threshold from prior knowledge within the domain. Here we will go with a handwaving threshold of 5%. This means that when we perform our test of statistical significance using a chi-squared distribution, or a Z test (two-tailed), we can expect that even if the null hypothesis were true, the test statistic would fall outside of the 95% interval of the statistic's distribution, 5% of the time, thus we would falsely reject the null hypothesis 5% of the time (false positive).

(Note: We can use either a chi-squared or a Z-test here, which will yield a different statistics, but same conclusion since Z, squared, is equivalent to the chi-squared statistic)

Having chosen the interval, now I can calculate the chi-squared statistic based on the null hypothesis. So what is the null hypothesis here? The null hypothesis here states that there is no difference between the clickthrough rates from our control group, as compared to the variation. Thus, if we find a statistically significant result, we can reject this claim, and accept the alternative that our variation did indeed have an impact.

If there is no difference between the groups, we can assume that they both derive from the same distribution (we assume the frequency of clickthroughs is the same). We simply sum up total clicks between both groups and divide by the grand total views for the frequency of clicks, and total no-clicks, divided by the grand total views for no-click ratio.

Expected Distribution
Clicks Ratio (50 + 96) / (200 + 320) = 0.28
No-clicks Ratio (150 + 224) / (200 + 320) = 0.72
Total 1

Since this is the expected distribution under the null hypothesis, both groups are assumed to have come from the same distribution. We can calculate the expected number of clickthroughs and no-clicks from both groups by multiplying the views in each respective group with the frequency of interest:

Expected Outcomes Control Variation
Clicks 0.28 * 200 = 56 0.28 * 320 = 89.6
No-clicks 0.72 * 200 = 144 0.72 * 320 = 230.4
Total Views 200 320

We can calculate the chi-squared statistic with the following formula, where $O_{i}$ is the value of each observed result, and $E_{i}$ is the respective expected value:

$$\chi ^{2}=\sum _{i=1}^{n}{\frac {(O_{i}-E_{i})^{2}}{E_{i}}}$$ $$\chi ^{2}={\frac {(50-56)^{2}}{56}} + {\frac {(150-144)^{2}}{144}} + {\frac {(96-89.6)^{2}}{89.6}} + {\frac {(224-230.4)^{2}}{230.4}}$$ $$\chi ^{2}=1.53$$

Finally, we can go to a chi-squared lookup table, for degrees of freedom (df) of 1, and chi-squared equal to 1.53. We find that the p-value for this statistic is 0.22, which means that the result is NOT significant at our threshold of 5%. Thus we didn't find any significant difference in clickthrough when we changed our image at our set p-value threshold.

Note that we can come to the same result using a two-tailed Z-test. The chi-square test we performed doesn't specify direction of impact, therefore the equivalent Z-test should be the two-tailed test. As an exercise, use the following Z-statistic formulation to see if you can arrive at the same results:

$$z={\frac {\bar{X}-\mu }{SE}}$$ $$z={\frac {p_{0}-p_{1}} {\sqrt{ p_{pooled}(1-p_{pooled}) * (\frac {1}{n_0} + \frac {1}{n_1}) }}}$$ where:

$p_{pooled} \equiv$ combined count of clicks for control and variation over total views

$p_{0} \equiv$ ratio of clicks for control group

$p_{1} \equiv$ ratio of clicks for variation group

$n_{0} \equiv$ number of views for control group

$n_{1} \equiv$ number of views for variation group