In this article, you’ll learn the following: |
Overview
A larger sample size is the fuel for higher statistical accuracy. In that regard, the sample size calculations tie some of the most important statistical parameters of an A/B test. These include the accuracy of your winners and the power with which you can detect nuanced effects in your campaigns. However, the underlying equation that determines this tradeoff demands effort to understand because it is a combination of many movable pieces. Understanding the different hyper-parameters and how to use them has been discussed in a separate post. This post is dedicated to explaining the underlying equation that makes it all work.
In essence, the equation for sample size calculations is a balancing equation that balances two possible hypotheses: the null hypothesis and the alternate hypothesis. The null hypothesis says that there is no actual difference between the control and the variation. The alternative hypothesis says that there is an actual difference between the two. There are two independent error rates (one for each hypothesis) that the sample size equation aims to balance. The first is the false positive rate (𝛂, Type-1 error), which is the chance that a winner was declared, yet the null hypothesis was true. The second is the false negative rate (β, Type-2 error), which is the chance that you missed out on declaring a winner when the alternate hypothesis was true.
If you think from the perspective of statistics, you can think of an improvement distribution, represented as a bell curve (see graphic) that summarizes your uncertainty after ‘n’ samples are collected. Now, this improvement distribution becomes narrower as more samples are collected because you become more certain about the true improvement. If you assume the null hypothesis, the improvement distribution is placed at 0. If you assume the alternate hypothesis, the improvement distribution is placed at the minimum difference you want to detect from the test (a term called Minimum Detectable Effect).
The final aim of the sample size equations is to reach a state where the improvement distribution for null and alternative hypotheses is narrowed down enough to ensure that the error rates 𝛂 and β are below the defined limits. Various factors come into play in the final equation, but the overview of the concept is simple. There are three components to the equation: the critical value which defines the cut, the width of the improvement distribution and the minimum difference you want to detect.
n =C*𝝈d2
NOTE: Each of these terms can take more complex or simpler expressions based on different contexts and use cases. However, what they represent always remains the same.
The three components of the sample size equation
Understanding the three components give you a working intuition of how the sample size calculator works without having to delve into the intricate mathematical derivation of the formula.
- Critical Value Constant (C): The critical value constant essentially balances the false positive rate (𝛂) and the false negative rate (β) to come up with a critical value of improvement (dotted X line) that can well segregate the two hypotheses (null and alternate). The corresponding constant is represented by the formula below:
where ɸ-1 = Inverse Cumulative Distribution Function of Standard Normal
Observing this formula shows you the inverse cumulative distribution function (CDF) for the standard normal distribution. How the inverse cumulative distribution function generates the desired constant given the two types of error rates requires one to understand the derivation of the sample size equation and is beyond the scope of this article.
- Variance of the Improvement Distribution (𝝈): The variance plays an important role in the whole equation because, fundamentally, statistical significance depends on resolving the uncertainty in your improvement estimate. In an A/B test, you are essentially trying to estimate the improvement between control and variation. According to sampling theory, you start with the variance in the data, and as you gather more and more samples, you keep reducing the uncertainty in your estimates. Uncertainty is represented by the standard error and is proportional to variance/sample_size. To achieve statistical significance, you need the standard error to be lower than the effect size that you are trying to detect.
When this uncertainty goes below the desired effect size, you can be sure that the difference exists. The actual formula for standard error combines the variance of the baseline and the variation in the proportion of the traffic allocated to them.
where 𝝈2 represents variance,
tx represents the traffic allocation
B represents Baseline
vi represents the variation
- The Minimum Detectable Effect (d): Finally, the term in the denominator represents the minimum effect size that you wish to detect as significant from the test. As mentioned above, the ratio of the standard error and the minimum detectable effect matters to the sample size. In other words, the larger the effect you want to detect, the larger the standard error you can get away with. Further, the smaller the effect you want to detect, the smaller the standard error you need.
The minimum effect size adjusted by ROPE can be represented as follows.
where MDE = Minimum Detectable Effect,
ROPE = Region of Practical Equivalence
𝛍 = Baseline Average
In essence, the three components come together to calculate the maximum sample size required to detect a minimum difference of MDE in the test.
The exact formula for sample size calculations is given below.
The detailed formula for sample size and duration calculations
The Difference Between Binary and Non-binary Metrics
The difference between binary and non-binary metrics is subtle and stems from the expression of variance.
Non-binary metrics are the metrics that can take real-numbered values such as 1,2,3 and so on. All non-binary random variables have two independent parameters that define them. The first is the central tendency of the metric represented by its average value. The second is the spread of the metric represented by its variance. For instance, if you have a revenue metric and the average is $100, the variance in the metric is an independent parameter and can range from $90 - $110 or $70-$130. For the same reason, the variance (or standard deviation) for all non-binary metrics is calculated from the sampled data.
In contrast, binary metrics can only assume 0 and 1 values and some examples are metrics such as conversion rates or bounce rates. Binary metrics are special because the variance of binary metrics is directly a function of their average value. Hence, all binary metrics can be defined only from their mean parameter. For instance, the conversion rate can be defined entirely with a mean conversion rate such as 10% or 20%. No other parameter is needed.
However, for sample size calculations, we need to define their variance which is defined as a formula for their average value. Once the variance is defined, the rest of the formula is the same for both.
Conclusion
The sample size equation is a crucial formula that allows for greater customization in an A/B test. At VWO, we have gone through various stages of derivation in this formula and strived to summarize it into a concise form that can be used and understood by a wider community. Specifically challenging was the extension of the formula to include the Region of Practical Equivalence, continuous metrics, and unequal traffic allocation.
We have simulated millions of A/B tests with the desired properties to observe if the defined statistical guarantees are met in the given sample sizes.
In the future, we plan to release the derivation of the sample size formula and our simulation results to demonstrate the statistical guarantees it provides.