APPLICABILITY: VWO Testing, except Multivariate testing.
Statistics are an indispensable part of A/B testing, and VWO allows for configuring crucial statistical parameters. The Statistical Configuration feature present at the top-right corner of your campaign report allows you to configure the statistical parameters required to drive your campaign towards your preferred analysis.
Basically, it is a collation of two different configuration sections - one for the entire campaign and the other for the individual metric. Clicking on the Edit icon along the required section to make the necessary changes. Editing the campaign-specific configurations will take you to the SmartStats Configuration section under the Configuration tab while editing the metric-specific configurations will take you to the Duration Calculator page.
NOTE: The changes applied to the metric on the Duration Calculator page will be applied only at the campaign level and will not be reflected in the metric’s inherent configurations, as saved in Data360 > Metrics.
Statistical Configuration for Campaign
These configurations are collectively applied to all metrics in the campaign and refer to statistical corrections in final probability values.
- Sequential Testing Correction: Statistical tests are designed to be run only once after the entire sample has been collected in the campaign. However, modern A/B testing requires detecting winning variations at the earliest to save precious visitors in a campaign.
Sequential Testing refers to the practice of continuously monitoring probability values so that a variation can be deployed (or disabled) as soon as significance is achieved. However, sequential testing leads to an increase in statistical errors, specifically the chance of declaring a false winner (also called the false positive rate).
Sequential Testing Correction hence adjusts your probabilities to make them more accurate based on the maximum sample size of the campaign and ensures that the false positive rates are maintained.
- Bonferroni Correction: A typical A/B test compares a single variation against a control group. However, when you test multiple variations in a campaign, the chance of randomly declaring a winner increases with each additional variation.
For instance, imagine running an A/A test (all variations identical to the control) with 20 variations and a 5% false positive rate (mistakenly identifying a winner by chance). There's a high chance that at least one variation will be falsely flagged as a winner due to chance alone.
Bonferroni’s correction addresses this issue by considering the number of variations within a campaign. It adjusts the statistical threshold (typically 95% confidence level) to account for the increased risk of false positives. This helps ensure that any observed improvements are less likely to be due to chance.
While traditional Bonferroni correction simply shifts the decision boundary, VWO takes a more sophisticated approach. It incorporates the uncertainty directly into the probability values themselves. This allows for a more nuanced analysis while still controlling the overall error rate across variations of the given campaign (family-wise error rate).
Statistical Configuration for Metrics
Each metric in VWO has configurable hyperparameters, which are settings that control how and when decisions will be made about that metric. You can set these hyperparameters independently for each metric in your campaign.
- ROPE (Region of Practical Equivalence): ROPE is defined as a range of uplifts that you consider negligible in a test. It is expressed as a percentage of the baseline average.
For example, if your base conversion rate is 10% and your ROPE region is from -1% to +1%, it means that any variation with a conversion rate in the range of 9.9% to 10.1% can be considered equivalent to the control.
ROPE offers several benefits. It allows for the early disabling of non-performing variations in the test. Additionally, ROPE enables the testing of non-inferiority, where the goal is to determine if a variation is equivalent or better than the control. For a detailed explanation of ROPE, click here.
- MDE (Minimum Detectable Effect): MDE is slightly different from ROPE and is also expressed as a percentage of the baseline average. MDE determines the statistical resolution of the campaign.
Imagine that the conversion rates of different variations in your test campaign are marked on a wall which you are looking at from a distance. Further, imagine that your vision is blurry, and you need glasses to differentiate between markings clearly. MDE defines the resolution with which you can clearly see two markings separately.
For instance, if the conversion rate of the baseline is 10% and the conversion rate of a variation is 11%, you will need high-power glasses to see the lines clearly or else the lines will seem to overlap from a distance. However, if the conversion rates are 10% and 15%, respectively, you will be able to see the difference even with low-powered glasses.
The power to differentiate between small differences comes from having more samples in an A/B test. Hence, MDE is the dominant determining factor in how many samples to invest in your campaign. As a rule of thumb, the finer differences you want to detect in a metric, the larger the sample size you need. Also, you will be able to detect differences larger than the MDE in the campaign but not smaller ones.
- False Positive Rate (FPR): Statistical testing has two types of error rates and FPR is one among them. It’s also called the Type-1 error rate in A/B testing and is defined as the chance that a decision is declared (either to disable or to deploy) in a campaign when there is no practical difference between the control and variation. This directly determines the cutoff point at which the winner will be declared.
For example, if you set the false positive rate at 5%, the decision will be made in the campaign when the probability of improvement reaches (100 - FPR/2) 97.5%.
NOTE: The False Positive Rate is distributed into two halves for disable and winner recommendations.
- Statistical Power: Statistical Power is closely related to the second type of error rate in statistical testing, Type-2 error. Statistical Power is defined as the chance of detecting the winner in the campaign if the underlying effect is actually greater than the MDE.
In other words, if your MDE is 5% and the statistical power is 80%, it means that if the true uplift in the campaign is more than 5%, there is an 80% chance that it will be detected in the given number of samples.
The above-stated hyper-parameters collectively determine the sample size of the experiment. Note that the experiment duration is defined via the parameters of the primary metric. However, different metrics might require different sample sizes to detect the desired effect. It is entirely possible that the campaign is sufficiently powered to test a 5% effect size in one metric but underpowered to detect a 5% effect size in another metric.
ATTENTION: With VWO reports, we have pre-configured the best defaults for your campaigns so that you generally don’t need to adjust the values of these parameters. If you are an advanced experimenter who fully understands your statistical requirements, you can customize these hyper-parameters according to your needs.