In this article, you’ll learn the following: |
Overview
Imagine you're running an online store and want to test two new product page designs: one with a classic layout and another with a modern look. You want to see which design leads to more customers adding items to their carts (conversions). The numbers on the results page can be confusing! This is where VWO’s “in-depth data review” feature helps you make sense of your test results.
When you switch ON in-depth data review, it highlights important information in your test results. Here’s what you’ll see:
Graphical Representation of Expected Improvement
The Expected Improvement shows a range of how much better (or worse) each design might be doing compared to the original design (baseline). The graph highlights several key areas:
- Blue Zone (ROPE): This is the Region of Practical Equivalence. If a design's range touches this zone, the difference in conversions might not be significant yet. Variations in this zone are indicated in blue.
- Red Zone: If a design's range is entirely to the left (negative side) of the upper ROPE value (+ROPE%), it likely leads to fewer conversions than the baseline. These variations are marked red, and VWO recommends disabling them.
- Green Zone: If a design's range is entirely to the right (positive side) of the upper ROPE value (+ROPE%), it likely leads to more conversions than the baseline. These variations are marked green.
A variation that falls entirely within the blue or red region, or crosses from blue into red, will be disabled. VWO will then mark this variation as “Not Better than Baseline.”
When statistical corrections like Sequential Testing or Bonferroni are applied, VWO automatically factors those corrections into the expected improvement interval for each variation. Hovering over a variation range bar will display its expected improvement range.
Hovering over the expected improvement interval shows two values:
- Expected Improvement: The median of the statistically calculated and sequentially corrected expected improvement.
- Expected Improvement Interval: The interval representing the uncertainty in the expected improvement.
Decision Probabilities
Simple-view reports show the relevant decision-making probability for the test. By default, this value is the probability of better, which should reach a threshold value for a variation to be declared a winner.
The in-depth view shows three probabilities corresponding to Better (green zone), Equivalent (blue zone), and Worse (red zone), giving a clearer picture of the variation’s performance.
Detailed Performance Probabilities
The Probability of Better column will display the following statistical settings:
Minimum Detectable Effect (MDE)
- What It Is: The smallest difference in conversions you care about detecting.
- Impact: A smaller MDE might mean a longer test but ensures even small improvements are detected.
- Example: If you set the MDE to ±5%, you care about detecting changes of at least 5% of the baseline average.
Note: For guardrail metrics, you are trying to guard against maximum reduction in the test metric instead of improvement, and to convey this we appropriately rename MDE as MDR (Minimum Detectable Reduction).
Region of Practical Equivalence (ROPE)
- What It Is: The range around your baseline metric identifying statistically insignificant changes.
- Impact: Helps identify non-performing variations early.
- Example: If the ROPE is set to ±1%, sales differences between 9.9% and 10.1% are considered practically the same as a 10% baseline.
Statistical Power
- What It Is: The chance of correctly identifying the campaign winner when the actual effect is at least the MDE.
- Impact: Higher power means more reliable results but requires a larger sample size.
- Example: With 80% statistical power, your test has an 80% chance of detecting a real difference if there is one.
False Positive Rate (FPR)
- What It Is: The chance of mistakenly declaring a winner when there's no real difference.
- Impact: Lower FPR reduces the chance of false positives but requires a larger sample size.
- Example: An FPR of 5% means there's a 5% chance of incorrectly declaring a winner.
The Trade-off
These factors work together to determine how many visitors (samples) your test needs. More visitors make the results more reliable, but a longer test means keeping your lemonade stand open for a longer time.
Recommendations
There are pre-configured defaults for these factors that work well for most tests. Adjust them if you have specific needs, but sticking with the defaults is a safe bet for beginners.
The Bottom Line
In-depth data review helps you see your A/B test results clearly, making it easier to understand how each variation is performing. Don't worry about memorizing all the technical details—VWO handles those behind the scenes! Just focus on designing your tests and using the results to pick the winning variation.