In this article, you’ll learn the following: |
APPLICABILITY: Available on all plans for Web A/B Testing, Web Split Testing, Mobile App A/B Testing, and Feature Management and Experimentation.
Overview
When conducting a simple A/B test, you might find that the conversion rate for your control group is 10%, while the variation group shows a conversion rate of 15%. At first glance, it might seem obvious that the variation wins. Why, then, are statistics needed in experimentation?
However, it would not be so obvious if the variation that is winning had conversion rates of 10% and 10.2%. It would have seemed much more doubtful if the difference was due to chance or the variation's true impact. This means determining whether the difference is reliable or not. As it turns out, most experiments lead to nominal uplift, raising the question of whether the uplift is significant enough to be considered true.
Here's where the concept of statistical significance comes into play. Statistical significance helps us determine whether the observed improvement in a metric is genuinely due to an actual difference or just a result of random chance. Understanding this principle is crucial to making informed decisions while A/B testing.
The Five Levels of Statistical Inference
The statistical machinery of significance breaks down elegantly into five levels of information. The following hierarchy exhaustively explains the structure of the stats engine.
Level 0: Empirical Data
The base level is the actual data collected from your A/B test. It's the raw numbers showing how each group performed regarding recorded metrics like revenue, conversions, and visitors. This data is the ground truth of all further analysis and does not have any uncertainty. The first column in the table shows this data.
The data for Binary metrics such as conversion rate and the add-to-cart rate is labeled Unique Conversions/Visitors. The data for Real-valued metrics such as revenue per visitor is labeled Total (Conversions/Visitors).
Level 1: Expected Average
This is the first level of statistics. We use the empirical data to form statistical projections about the likely range of averages for each variation. These projections are not fixed numbers but distributions showing all possible averages and their corresponding probabilities. At VWO, we model these summarized metrics as normal distributions. The second column in the table shows these statistics.
It is labeled Expected Conversion Rate for binary metrics such as conversion rate and add-to-cart rate. Real-valued metrics, such as revenue per visitor, are labeled Expected Value Per Visitor.
Level 2: Expected Improvement
Next, we calculate the difference between the average distributions of the variation and the baseline. This difference, or Improvement Distribution, helps us understand how much better or worse the variation is performing compared to the baseline.
It is shown It is shown in the third column of Expected Improvement. In the detailed view, you can see a box plot of the improvement posterior in the table.
Level 3: Probability of Improvement
At this level, we infer from the Improvement Distribution the chance that improvement lies above ROPE. Essentially, this represents the probability of the variation being better than the baseline. This is the probabilistic measure of statistical significance. The higher this probability, the more confident we can be in the variation’s performance. The last column shows the final measure of significance.
Level 4: Decisions
Finally, decisions are made by applying a desired threshold on the probability of improvement. If this probability exceeds a certain threshold, typically 95%, we can conclude that the variation is statistically better than the baseline. Conversely, if it's below a certain threshold, say 5%, we conclude it's not better. This decision-making process ensures that we don't deploy variations based on random fluctuations in the data. The winner threshold is shown as a dotted line in the table, crossing which decisions are declared.
Conclusion
This five-level structure forms the backbone of the statistical inference engine used in VWO. It ensures that every decision made is backed by statistical significance, minimizing the risks of making false decisions. At VWO, these levels are meticulously detailed in reports up to the fourth level, with the fifth level providing clear recommendations on whether to adopt or disable the variation permanently. This structured approach helps make more accurate and reliable decisions in metric optimization efforts.