A statistical equation to balance desired accuracy and required sample sizes ties together some crucial hyperparameters (i.e., settings that control how and when decisions will be made about that metric) of an experiment. An advanced experimenter can use this equation to flexibly configure the accuracy and length of a test campaign. Let’s take a deep dive into these hyperparameters and learn how to use them effectively.
Campaign Duration: Patience Yields Precision
Statistics is used in A/B testing to determine if the observed difference between the control and variation is not due to random chance. A crucial fundamental that drives statistical significance is that smaller differences demand more sample sizes to be detected, whereas larger differences get detected quickly. This often sounds counterintuitive. To understand this better, consider the following example.
Noticeably, it takes significantly less time for the human eye to detect larger differences (A) and considerably longer to detect smaller differences (C). This statistical intuition also applies to an A/B test.
Parameters of the Duration Calculator
- Baseline Average: This is the average of the primary metric on your baseline. This is automatically learned from the campaign data and determines the starting point to define other parameters. Higher baseline averages translate to lower sample size requirements as differences become easier to spot.
For instance, consider two cases where average baseline revenue is 10$ and 1$ per visitor. A 10% uplift would refer to 1$ and 10 cents, respectively. To detect this difference, it takes fewer samples for an uplift of 1$ compared to 10 cents.
- Daily Average Visitors: This represents the average number of visitors that become part of your experiment in a day. It is used to estimate the duration of the test. VWO automatically learns this parameter from the campaign data.
- Minimum detectable effect (MDE): MDE is defined as the smallest change that you want to detect in your experiment. It is the critical parameter that determines the duration of your test. As a thumb rule, smaller values of MDE will lead to a longer duration of the test but will detect minute differences between variation and control (and vice-versa).
- Region of practical equivalence (ROPE): Region of Practical Equivalence refers to the range of improvements that can be considered negligible from a business perspective. Setting a larger ROPE region will benefit you in eliminating underperforming variations faster. However, as a tradeoff it will also increase the sample sizes needed to detect actual winning variations.
NOTE: Your campaign’s MDE should always be greater than the ROPE.
- False positive rate (FPR): False Positive Rate defines the cutoff at which you want to declare decisions. So, if the FPR is set at 5%, decisions will be declared when corresponding probabilities of the variations hit (100 - FPR/2) 97.5%. Note that VWO campaigns are two-sided (declares two possible recommendations: disable and winner), and hence, FPR is divided into two halves for each recommendation.
Statistically, FPR is a measure of accuracy defined as the likelihood of declaring a false decision when there is no actual improvement in the test. Lower FPR values will require more samples in the test. By default, FPR is set to 10% as per convention.
- Statistical Power: A high-powered experiment provides more guarantee that a true effect will not be missed but demands a larger sample size. A statistical power of 90% means that if there is a true uplift > MDE in the primary metric, then there is a 90% chance that it will be detected. By convention, statistical power is kept at 80%.
Playing Around With Hyperparameters
Statistics is an integral part of a successful experimentation program, and understanding the explained tradeoffs can help you derive more value from your experiments. In the initial stages of an experimentation journey, traffic is relatively lesser, but opportunities for big uplifts on your core metrics are higher. This is because new websites are generally unoptimised and offer low-hanging fruits that can be utilized to make drastic improvements to the user experience. At this stage, prefer running experiments with a higher MDE that require fewer visitors.
Gradually, as you optimize your website, higher uplifts become rarer and drastic improvements are hard to come across. However, you have the privilege of higher traffic volume to be able to test smaller and more nuanced differences. Hence, in the later stages, you should run high-volume experiments with a lower MDE to detect nuanced differences.
We at VWO encourage you to understand and play around with these trade-offs to inculcate a higher statistical awareness in your experimentation program.