In this article, you’ll learn the following: |
What is Statistical Power?
Have you ever run an A/B test and ended up unsure whether the observed results were a fluke or a genuine improvement? Statistical power helps address this dilemma. It's the probability of your test correctly identifying a winner when there's a true uplift (improvement) larger than the Minimum Detectable Effect (MDE).
Imagine conducting 100 A/B tests where a true improvement exists in the target metric (e.g., conversion rate). Statistical power determines how many of those tests will statistically detect a winner.
Analogy: Statistical Power as a Thermometer
Think of statistical power like a thermometer. A low-powered thermometer can easily detect high fevers but struggles with slight temperature changes. Similarly, a low-powered test might miss subtle uplifts in your campaign. Conversely, a high-powered test acts like a sensitive thermometer, effectively detecting both large and small improvements.
Impact of Statistical Power
Increasing statistical power increases the visitor volume of your campaign. Similar to MDE, statistical power helps control the maximum number of visitors needed for the test. The defined power corresponds to detecting an uplift of the size MDE, but it's more likely to detect uplifts larger than MDE.
Convention: The 80% Rule and Beyond
By convention, statistical power is typically set to 80%. Since MDE is usually a conservative estimate of the expected improvement, an 80% power setting translates to a higher chance of detecting larger uplifts in practice. VWO utilizes this default 80% power for all metrics, with the option for users to adjust it for crucial campaigns.
Best Practices for Statistical Power
Here are some key considerations for setting statistical power:
- Conservative MDE, Moderate Power: If your MDE is a conservative estimate (lower end of expected uplifts), a power of 80% is sufficient. This ensures a high chance of detecting uplifts larger than the MDE.
- High Power for Guardrails: Guardrail metrics (e.g., revenue) automatically shut down campaigns if negatively impacted. VWO recommends a high power (95-99%) for such metrics to reliably detect negative effects.
- High Power for Critical Tests: For business-critical experiments where certainty is crucial, you can increase the power at the expense of requiring more visitors.
- Lower Power, Not Ideal: Generally, avoid power settings below 80%. This can lead to inconclusive experiments. If limited by sample size, focus on using more sensitive metrics and a higher MDE.
By understanding and applying statistical power effectively, you can design tests that deliver reliable results. Remember, striking the right balance between power and visitor requirements is key to uncovering genuine uplifts that can optimize your campaigns.