Context is Everything |

Context is Everything |

Read Our Whitepaper ->

May 16, 2026

6 min read

Discovery Chat

Your startup's A/B tests aren't scientific, they're confirmation machines

Danylo Borodchuk

Co-Founder, COO

Most Series A companies think they're hypothesis-driven. They're really just running confirmation bias at scale.

You ran an experiment. It was statistically significant. You shipped it. You felt scientific.

You were running a confirmation machine.

Here's the trap: the difference between a company that experiments and a company with an experimentation culture isn't whether you run tests. It's whether you're actually learning from them. Most Series A companies skip the learning part. They see a p-value, interpret it as permission, and move forward.

That's not experimentation. That's just randomized action.

The Three Stages Nobody Talks About

Yuzing, who's built data and analytics at multiple high-growth startups, described three stages of experimentation maturity:

Stage 1: The Gut Instinct Years
No experiments at all. You're shipping based on founder intuition, customer conversations, and what you think is true. This is obviously limited. But it's honest—you know you're not being data-driven.

Stage 2: The Confirmation Habit
Now you run experiments. You form a hypothesis (sometimes), design a test, collect data, and ship whatever shows statistical significance. You've elevated your process. You have rigor.

Except you don't. What you've actually done is systematize your confirmation bias. You're not learning faster—you're just making bad decisions feel scientific.

Stage 3: Hypothesis-Driven Learning
You have a hypothesis. You have success metrics. You understand what would have to be true for the experiment to matter. You run the test. You look at the results across segments and cohorts—not just the topline. You ask: who benefited? Who didn't? Does this change our thinking about how the product works?

Most Series A companies believe they're at Stage 3. They're at Stage 2 and don't know it.

Why Stage 2 Feels So Convincing

Running a test feels rigorous. You're using statistics. You're not just guessing. And the results—when they show a winner—feel like proof.

But statistical significance is not the same as a meaningful learning. Here's what actually happens at Stage 2:

You form a vague hypothesis. "Increase signups." "Improve retention." Not: "Users abandon at the email confirmation step because the CTA button color doesn't contrast with our background. Changing it to high-contrast will reduce drop-off by 3%." The vague version sounds faster. It's also useless, because even if you get a winner, you don't understand why.

You see a p-value and call it a decision. The experiment ran for two weeks. You hit 0.05 significance. You ship. What you don't ask: did you power the test to detect the effect size you care about? Did you calculate your sample size up front, or are you just peeking at results whenever they look good? (Airbnb found this was rampant—tests would hit significance, then converge back to nothing because the company kept peaking.)

You miss that the win is asymmetric. The topline shows a 4% uplift. But you didn't cut the data by segment. What you actually found: power users see 12% uplift. New users see -2% decline. But the overall number was positive, so you shipped. Now you've solved a problem for power users and created a new one for new users. You've learned nothing.

You confuse action with learning. "We ran a test. We got a winner. We shipped." That's action. Learning is: "We ran a test. We got a winner. We learned that this user segment values X, which means our onboarding assumption was wrong. Here's what we're testing next."

Stage 2 looks like iteration. It's just thrashing.

What Breaks at Scale

The Stage 2 problem becomes obvious once you have any traffic. You can run tests fast—throw five simultaneous experiments at the wall, ship the two that hit significance. This creates the illusion of rapid learning. In reality, you're running 5 confirmations of pre-existing biases, and some of them are going to contradict the others.

You end up with a product that's a patchwork of significance artifacts. Your analytics dashboard shows 47 metrics, all individually optimized, none pointing toward coherent strategy. Your founder is frustrated: "We're testing constantly but nothing seems to compound."

The reason: you stopped asking "what does this teach us about our customer" and started asking "which number went up?"

How to Spot if You're Stuck at Stage 2

You run a lot of tests but nothing changes. Shipping is constant. Learning is absent. The tests are tactical tweaks—button colors, copy variants, signup flow micro-optimizations. None of them connect to each other or to a larger hypothesis about product-market fit.

Your experiment results rarely surprise you. If every test confirms what you already believed, you're not actually testing assumptions—you're just validating hunches with statistics.

You don't segment your results. When an experiment "wins," you look at the topline. You don't break it down by user cohort, geography, feature usage, or customer segment. You have no idea if the win is broad-based or driven by a small group.

Your winning experiments don't predict future behavior. You shipped the test, celebrated the win, and then... six months later, the metric drifted back. No sustained learning. No second-order effects documented. Just: we thought this would work, the test said yes, and then it didn't matter.

What Stage 3 Actually Looks Like

Hypothesis-driven experimentation is slow on the front end and fast on the back end.

You spend time formulating a real hypothesis. Not "increase conversion" but "mobile users are converting at half the rate of desktop users because the checkout flow has unnecessary steps. Reducing the steps by two will increase mobile conversion to parity with desktop." That takes thought. It means understanding your funnel, talking to users, and making a falsifiable prediction.

You design the test to actually answer the question, not just see if the metric moves. You calculate sample size up front. You decide the minimum effect size you care about. You set a stopping rule. You're not peeking.

You analyze by segment. When you get results, the first thing you do is ask: "Who does this affect? What does it tell us about how people actually use this product?" A 5% topline lift that's driven entirely by users who already pay is a different insight than a 5% lift that's broad-based.

You connect the result to the next experiment. You're building a chain of learning, not a collection of wins.

Most importantly: you treat negative results as data, not failures. "This experiment failed" is low-resolution. "This experiment showed that our assumption about X was wrong, which means we need to rethink Y" is learning.

Why This Matters to Your Growth

Every company at Stage 2 believes they're running experiments. They're really running a high-powered version of gut-feel decision making, but with a p-value attached.

The cost is opportunity: you're not learning what your customers actually want. You're learning what your current assumptions suggest they want. And you're reinforcing those assumptions with every test.

Stage 3 costs more time up front—you have to think before you test. But the learning compounds. An experiment that teaches you something about user behavior in one part of your product informs decisions in other parts. You build a coherent product, not a random collection of optimized elements.

Making the Shift

Start by treating one experiment as a real learning exercise. Not five parallel tests for speed. One test, with a clear hypothesis, designed to teach you something that will change your next decision.

When you get results, don't ask "did it win?" Ask "what was I wrong about, and what do I do next?" Segment your analysis. Look for where the effect is strong and where it's weak. Those edges are where the learning lives.

The infrastructure matters here. If running an experiment requires three weeks and an engineer, you'll keep running Stage 2 (confirmation) because you can't afford to run Stage 3 (learning). You need a setup where you can test an idea in days, analyze the results by segment in minutes, and decide on the next test in an afternoon.

At Lopus, we see teams getting stuck in this exact trap. They have dashboards, they run tests, but they're not segmenting results. A test shows a win, but they don't know if it's driven by mobile or desktop, new users or retained, free or paid. Discovery chat lets you ask "show me this test result by customer segment" in plain English — text to SQL, no analyst queue. Suddenly the win becomes a learning: where is the effect real? That question changes everything.

The shift from Stage 2 to Stage 3 isn't about running more tests. It's about asking better questions.

What does your current experiment culture look like? Are you seeing results that actually stick, or do they fade after you ship?

Alerts

Your dashboard is showing you yesterday's fire

Dashboards are snapshots. Real-time alerts catch problems when they start. Here's why the combo matters.

Author:

Danylo Borodchuk

Duration:

4 min

Alerts

Your dashboard is showing you yesterday's fire

Dashboards are snapshots. Real-time alerts catch problems when they start. Here's why the combo matters.

Author:

Danylo Borodchuk

Duration:

4 min

Discovery Chat

Your north star metric is probably wrong

The metric you're organizing your company around might be incentivizing the wrong behavior. Three signals it's time to change it.

Author:

Danylo Borodchuk

Duration:

7 min

Discovery Chat

Your north star metric is probably wrong

The metric you're organizing your company around might be incentivizing the wrong behavior. Three signals it's time to change it.

Author:

Danylo Borodchuk

Duration:

7 min

Deep Research

Your conversion rate is hiding the real problem

The 80% who drop off matter more than the 20% who convert. Here's how to find why.

Author:

Danylo Borodchuk

Duration:

7 min read

Deep Research

Your conversion rate is hiding the real problem

The 80% who drop off matter more than the 20% who convert. Here's how to find why.

Author:

Danylo Borodchuk

Duration:

7 min read

Alerts

Your dashboard is showing you yesterday's fire

Dashboards are snapshots. Real-time alerts catch problems when they start. Here's why the combo matters.

Author:

Danylo Borodchuk

Duration:

4 min

Discovery Chat

Your north star metric is probably wrong

The metric you're organizing your company around might be incentivizing the wrong behavior. Three signals it's time to change it.

Author:

Danylo Borodchuk

Duration:

7 min

Deep Research

Your conversion rate is hiding the real problem

The 80% who drop off matter more than the 20% who convert. Here's how to find why.

Author:

Danylo Borodchuk

Duration:

7 min read

Discovery Chat

Stop refilling the bucket and fix the leak

Acquiring customers while losing them at the same rate isn't growth. It's a treadmill. Here's why retention-first wins.

Author:

Danylo Borodchuk

Duration:

7 min read

Your startup's A/B tests aren't scientific, they're confirmation machines

Your startup's A/B tests aren't scientific, they're confirmation machines

Your startup's A/B tests aren't scientific, they're confirmation machines

Most Series A companies think they're hypothesis-driven. They're really just running confirmation bias at scale.

The Three Stages Nobody Talks About

Why Stage 2 Feels So Convincing

What Breaks at Scale

How to Spot if You're Stuck at Stage 2

What Stage 3 Actually Looks Like

Why This Matters to Your Growth

Making the Shift

Your dashboard is showing you yesterday's fire

Your dashboard is showing you yesterday's fire

Your north star metric is probably wrong

Your north star metric is probably wrong

Your conversion rate is hiding the real problem

Your conversion rate is hiding the real problem

Your dashboard is showing you yesterday's fire

Your north star metric is probably wrong

Your conversion rate is hiding the real problem

Stop refilling the bucket and fix the leak

Your data has the answers. You just can’t reach them yet.

Your data has the answers. You just can’t reach them yet.

Your data has the answers. You just can’t reach them yet.