A/B Testing Guide: How to Run Experiments That Drive Growth
A/B testing removes the guesswork from optimization. Instead of debating whether a green button converts better than a blue one, you test both and let data decide. The companies that grow fastest share a common trait: they test constantly.
This guide covers everything you need to run valid A/B tests, from forming hypotheses to analyzing results without falling into common statistical traps.
What Is A/B Testing?
A/B testing compares two versions of something to see which performs better. You split your traffic randomly between version A (the control) and version B (the variant), then measure which produces more of your desired outcome.
What you can test:
- Headlines and copy
- Button colors, text, and placement
- Page layouts and designs
- Pricing displays
- Email subject lines
- Checkout flows
- Product features
- Onboarding sequences
The power of A/B testing lies in compounding small improvements. A 5% lift here, 8% there, 12% on another test. After a year of consistent testing, these add up to transformational growth.
Why A/B Testing Matters
Without testing, you’re making decisions based on opinions, intuition, or what worked for someone else. These approaches fail more often than they succeed because your context, audience, and product are unique.
The case for data-driven decisions:
- Opinions don’t convert visitors. Data reveals what actually works.
- Incremental improvements compound. Ten 5% improvements equal 63% total lift.
- Testing reduces risk. Validate changes before full rollout.
- Builds organizational learning. Every test teaches something.
The alternative is HiPPO decision-making: the Highest Paid Person’s Opinion wins. This is how companies ship features nobody wants and redesigns that tank conversions.
A/B Testing Terminology
Before diving deeper, learn the language:
| Term | Definition |
|---|---|
| Control | Original version (A) |
| Variant | New version being tested (B) |
| Conversion | Desired action (signup, purchase, click) |
| Sample size | Number of visitors in test |
| Statistical significance | Confidence the result isn’t random chance |
| Confidence level | Typically 95% (5% chance of false positive) |
| Effect size | Magnitude of the difference between versions |
| MDE | Minimum Detectable Effect - smallest difference you can reliably measure |
Forming Strong Hypotheses
Good A/B tests start with good hypotheses. A hypothesis isn’t just “let’s try this.” It’s a structured prediction based on evidence.
Hypothesis Structure
Use this format: “If we [change X], then [metric Y] will [increase/decrease] because [reason Z].”
The “because” is critical. It forces you to articulate why you expect the change to work, which helps you learn regardless of the outcome.
Good hypothesis examples:
- “If we add customer testimonials near the signup button, then signups will increase because social proof reduces uncertainty for hesitant visitors.”
- “If we reduce checkout form fields from 6 to 3, then completion rate will increase because fewer fields mean less friction.”
- “If we change the CTA from ‘Submit’ to ‘Get My Free Report,’ then clicks will increase because value-focused language is more compelling than generic commands.”
Bad hypothesis examples:
- “Let’s try a green button” (no reasoning, no expected outcome)
- “This will definitely work” (not measurable, no learning opportunity)
- “Our competitor does this” (copying isn’t a hypothesis)
Where to Find Test Ideas
Not sure what to test? Look here:
- Analytics - Find high-traffic pages with low conversion rates. These offer the biggest opportunities.
- User feedback - Survey responses, support tickets, and interview transcripts reveal friction points.
- Heatmaps and recordings - Watch real users struggle with your interface.
- Exit surveys - Ask bouncing visitors why they’re leaving.
- Competitor analysis - Note what others do differently (but form your own hypothesis about whether it’ll work for you).
- Best practice research - Industry studies suggest patterns that often work.
Statistical Significance Explained
Statistical significance measures the probability that your test result isn’t due to random chance. A 95% confidence level (industry standard) means there’s only a 5% chance you’re seeing a false positive.
What Significance Is and Isn’t
Significance tells you whether the difference is real. It doesn’t tell you whether the difference is meaningful for your business. A test might show a statistically significant 0.1% improvement, but that improvement might not matter commercially.
Conversely, a test might show a 15% improvement that’s not statistically significant because you didn’t run it long enough. The improvement might be real, but you can’t be confident yet.
Sample Size Calculation
Sample size determines how long you need to run your test. Four factors affect the required sample:
- Baseline conversion rate - Higher baseline needs fewer visitors to detect a change
- Minimum detectable effect (MDE) - Smaller effects require larger samples
- Statistical power - Typically 80% (chance of detecting a real effect)
- Significance level - Typically 95%
Sample size examples (per variation):
| Baseline Rate | MDE (Relative) | Sample Needed |
|---|---|---|
| 2% | 20% | ~15,000 |
| 5% | 20% | ~6,000 |
| 10% | 20% | ~3,000 |
| 2% | 50% | ~2,500 |
Use a sample size calculator rather than doing this math yourself. Evan Miller’s calculator (evanmiller.org/ab-testing/) is the standard.
Why Sample Size Matters
Running tests with insufficient sample size is the most common A/B testing mistake. Small samples produce noisy data that swings wildly. You might see variant B “winning” by 30% on Monday, then “losing” by 20% on Tuesday. Neither result is reliable.
The math is unforgiving. If your site gets 1,000 visitors per week with a 2% conversion rate, detecting a 20% relative improvement takes 15 weeks per variation. That’s 30 weeks total for one test.
This is why low-traffic sites often can’t do meaningful A/B testing. You need volume or you need to accept only detecting large effects.
Running Your Test
Pre-Test Checklist
Before launching:
- Hypothesis documented with expected outcome
- Sample size calculated and test duration estimated
- Tracking verified (conversions firing correctly)
- All variants QA tested on multiple devices
- No other tests running on the same page
- Stakeholders informed of timeline
Traffic Allocation
Split traffic 50/50 between control and variant. Equal splits maximize statistical power and minimize test duration.
Consider 90/10 splits only for very risky changes where you want to limit exposure. This significantly increases required sample size.
Ensure random assignment. Visitors should be randomly assigned to a variant, not based on any characteristic that might correlate with conversion likelihood.
Test Duration
Run tests for complete weeks to account for day-of-week effects. Monday visitors often behave differently than weekend visitors. Stopping mid-week skews results.
Minimum duration guidelines:
- At least 1-2 complete weeks
- Until reaching calculated sample size
- Don’t stop because results “look good”
- Don’t extend because results aren’t significant yet
The last two points are critical. Stopping early when you see a “winner” dramatically inflates false positive rates. Extending tests until you see significance is equally problematic.
What Not to Do During a Test
Don’t peek and stop early. Looking at results repeatedly and stopping when you see significance isn’t valid testing. Each peek increases your false positive rate. Tools like Optimizely handle this with sequential testing, but simple calculators assume you look once.
Don’t change the test mid-experiment. If you modify the variant or adjust the goal, you’ve started a new test. Previous data is invalid.
Don’t run multiple tests on the same page. Interaction effects make results uninterpretable. Test one thing at a time per page.
Analyzing Results
Reading Your Results
| Scenario | Interpretation | Action |
|---|---|---|
| Variant significant positive | Winner found | Implement variant |
| Variant significant negative | Variant is worse | Keep control |
| No significance reached | Inconclusive | Need more data or different test |
Focus on One Primary Metric
Every test needs one primary metric that determines the winner. Secondary metrics provide context but don’t drive the decision.
If you measure 10 metrics and cherry-pick the one that shows significance, you’re not testing—you’re hunting for false positives.
Segment Analysis
After determining overall results, segment analysis can reveal hidden patterns:
- Device - Mobile often behaves differently than desktop
- Traffic source - Paid vs organic visitors may respond differently
- New vs returning - First-time visitors and repeat visitors have different contexts
- Geography - Regional differences can be significant
A test might be flat overall but show a strong positive effect for mobile users offset by a negative effect for desktop. Understanding segments helps you make better decisions.
Caution: Don’t over-segment. With enough segments, you’ll find “significant” results by chance. Segment analysis generates hypotheses for future tests, not conclusions.
A/B Testing Examples
Example 1: Landing Page Headline
Control: “Project Management Software” Variant: “Ship Projects 2x Faster” Result: +15% signups Why it worked: Benefit-focused language speaks to outcomes users care about. Features describe what it is; benefits describe what it does for you.
Example 2: Pricing Page Layout
Control: Three pricing tiers in equal columns Variant: Middle tier highlighted as “Most Popular” Result: +8% plan selection Why it worked: Social proof and anchoring reduce choice paralysis. The highlight suggests “this is what most people choose,” making the decision easier.
Example 3: Checkout Flow
Control: Multi-page checkout (4 steps) Variant: Single-page checkout Result: +12% completion Why it worked: Each page transition is a potential abandonment point. Consolidating reduces dropout opportunities.
Example 4: CTA Button Text
Control: “Submit” Variant: “Get My Free Report” Result: +25% clicks Why it worked: “Submit” is generic and gives no indication of value. “Get My Free Report” specifies the benefit and uses first-person language that creates ownership.
Multivariate Testing
Multivariate testing (MVT) tests multiple elements simultaneously. Instead of testing headline A vs headline B, you test combinations: Headline A + Image 1, Headline A + Image 2, Headline B + Image 1, Headline B + Image 2.
When MVT makes sense:
- High traffic (you need sample size for each combination)
- Testing related elements that might interact
- Understanding which combinations work best
MVT limitations:
- Requires significantly more traffic (4 combinations needs 4x the sample)
- More complex setup and analysis
- Harder to interpret what drove results
For most teams, sequential A/B tests are more practical. Test headline first, implement winner, then test images. Less elegant but more achievable.
A/B Testing Tools
Free and Low-Cost Options
| Tool | Best For | Notes |
|---|---|---|
| PostHog | Product analytics + testing | Generous free tier, open source option |
| GrowthBook | Feature flags + experiments | Open source, free self-hosted |
| Google Optimize | Basic web testing | Being sunset, look for alternatives |
| Your email platform | Email subject lines | Most ESPs include A/B testing |
Paid Tools
| Tool | Best For | Starting Price |
|---|---|---|
| VWO | Mid-market companies | $199/month |
| Convert | Privacy-focused testing | $99/month |
| Optimizely | Enterprise | Custom pricing |
| AB Tasty | Enterprise with personalization | Custom pricing |
DIY Approaches
With engineering resources, you can build simple A/B testing:
- Feature flags - Tools like LaunchDarkly or simple config flags
- Redirect tests - Send traffic to different URLs, measure in analytics
- Server-side splits - Random assignment at the server level
DIY works for teams with technical capacity who want full control over their testing infrastructure.
Building a Testing Culture
Getting Organizational Buy-In
Testing culture starts with visible wins. Run tests with high potential impact, document results thoroughly, and share learnings broadly. Success stories build support for continued investment.
Frame testing as risk reduction, not just optimization. “We tested this before rolling it out” sounds responsible. “We’re launching based on a hunch” doesn’t.
Testing Velocity
More tests equal more learnings. High-performing growth teams run 2-4 tests per month minimum.
Build a testing backlog prioritized by:
- Impact - How much could this move the metric?
- Confidence - How likely is the hypothesis to be correct?
- Ease - How hard is this to implement?
Score each dimension, multiply for priority ranking.
Documentation
Document every test:
- Hypothesis (with reasoning)
- Variants (screenshots)
- Results (significance, effect size, segments)
- Learnings (what did we learn regardless of outcome)
- Next steps (what test follows from this)
Failed tests are as valuable as wins if you learn from them. A hypothesis that didn’t work tells you something about your users.
Common A/B Testing Mistakes
-
Testing without a hypothesis - Random changes don’t teach you anything. Always articulate why you expect the change to work.
-
Insufficient sample size - Small samples produce unreliable results. Calculate required sample before starting.
-
Stopping tests early - “Peeking” and stopping when results look good dramatically inflates false positives.
-
Testing trivial changes - Button color rarely matters. Test changes that could plausibly have meaningful impact.
-
Ignoring segments - Overall results might hide important segment-level differences.
-
No documentation - Without records, you’ll repeat mistakes and lose learnings when team members leave.
-
Testing everything - Focus on high-traffic, high-impact areas. Testing a page with 100 monthly visitors is pointless.
-
Copying competitors - What works for them might not work for you. Form your own hypotheses.
A/B Testing Checklist
Before Testing
- Identified high-impact opportunity
- Formed clear, documented hypothesis
- Calculated required sample size
- Estimated test duration
- Set up correct tracking
- QA’d all variants thoroughly
During Test
- Monitoring for technical issues only
- Not peeking at results to make decisions
- Not making changes to variants
- Running for full calculated duration
After Test
- Reached statistical significance
- Analyzed primary metric
- Checked key segments
- Documented learnings
- Implemented winner (or iterated)
- Shared results with team
Key Takeaways
A/B testing transforms decision-making from opinion-based to data-driven. The companies that test most learn fastest.
Remember:
- Sample size is everything. Small samples produce unreliable results.
- Hypotheses drive learning. Random tests don’t teach.
- Don’t peek. Stopping early invalidates your results.
- Document everything. Learning compounds when captured.
- Focus on high-impact areas. Not everything is worth testing.
Start with your highest-traffic, lowest-converting page. Form a hypothesis about why it underperforms. Design a variant. Calculate sample size. Run the test. Learn. Repeat.
The best time to start testing was a year ago. The second best time is now.