Suppose you have been running an A/B test for a week now, and every day you are asked by your business stakeholders, “How long are we planning to run the test? Do we have a significance yet?”. This is not an unusual situation. In fact all product managers run into this issue. Except that many a times we have no idea how long should we be running the test, so we look at the results in a hope that we reach significance. The problem compounds if you are running a test but you expect no uplift — This could be either due to Aesthetic reasons or revenue upside. How long should you run it? Tricky isn’t it?
We should ideally never start a test without knowing how many samples we are going to collect. Why? Otherwise, you will be looking at data and you will end up doing ‘Data Peeking’, which is stopping the test as soon as you achieve significance. Here is an example — Suppose you have a coin and your hypothesis is that it is fair. How do you prove that? Simple — toss it 100 times. But what if you tossed it 10 times and saw tails 10 times. It seems statistically significant to stop the test at this point in time and reject the Null hypothesis — that the coin is fair. What went wrong? You stopped the test a little too soon. You had no idea to begin with how long you should have run the test. The other problem that you may run into if you have not calculated the sample size is that you wont be able to say confidently how long you are going to run the test for.
So how do we approach this?
Follow the first rule of product management — Embrace the ambiguity but avoid the uncertainty.
This is how we can approach calculating the sample size: Suppose we are running an A/B test that where: Our current conversion rate for an event such as % of users signing up for email is 10% and we expect a 10% uplift in conversion if the treatment wins. Then,
Baseline conversion: P1 = 20%
Uplift in conversion: 10% (This is what you estimated as the expected impact of your change). As part of growth team, we usually aim for 20% uplift but even 10% could be big depending on how matured your product is. The higher the uplift the sooner you reach significance.
Expected conversion of the treatment group: P2 = 20%*(1+10%) = 22%
Significance level: This is the chance of a false positive i.e. at 5% significance level what is the chance that we will reject the null hypothesis when it was in reality (Which you would never know) was true. Of course, we want to minimize this error so we choose 5%. If you have less traffic then you may want to increase this to 10% or even 20%.
False Positive: Type I error — Rejecting the null hypothesis when it is true
Statistical Power: This is the probability that you will get a false negative. Phew! Power (= 1 — Type II Error) is the probability of avoiding a Type II error or in other words Power is the probability that the test will detect a deviation from the null hypothesis, should such a deviation exist. Typically we set it to 80%.
False Negative : Type II error — Failing to reject the null hypothesis when it is false
Now we have everything that we can actually go ahead and calculate the sample size needed. We can either use an online calculator, G power tool, or R. Depending upon which tool you are using you may see slightly different numbers but that is okay.
Let us see each one of them one by one:
a) Online calculator such as this one here
b) Use G*Power tool: Download the tool from here. Go to Test family ‘Z tests’, Statistical tests as ‘Proportions: Difference between two independent proportions’ and add the P1, P2, Alpha (Statistical significance), Power = 0.8.
c) R: The function that we are going to use is power.prop.test (man page).
power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05, power = NULL, alternative = c(“two.sided”, “one.sided”), strict = FALSE)
power.prop.test(n = NULL, p1 = 0.2, p2 = 0.22, power = 0.8, alternative = 'two.sided', sig.level = 0.05)
This is the output that you will get in R
Two-sample comparison of proportions power calculation n = 6509.467
p1 = 0.2
p2 = 0.22
sig.level = 0.05
power = 0.8
alternative = two.sidedNOTE: n is number in *each* group
This means that we would need about 6510 samples in each group. Which means we would need 13020 traffic.
Now suppose you know historically that your website traffic is 2000 visitors then you know you have to run your hypothesis testing for 6.51 days or 7 days.
Bonus point: It is always a good idea to cover all days of the week as most of the businesses have ‘weeklikality’ in their demand pattern.
Now next time you are about to run the A/B test, pre-calculate the sample size needed so that you can set the right expectations with your business stakeholders.
Just in case you found the sample size much larger that you don’t think you will get to significance given the traffic that your website has, don’t worry, in another post I will share some cool tricks on how to run A/B test when you do not have enough traffic. Until then, happy A/B testing.