How to Run an Incrementality Holdout Test Without Paying for Haus (A Step-by-Step for 6-7 Figure Brands)

I'll confess something that probably shouldn't come from someone who runs ads for a living: most of the numbers in your ad accounts are lying to you, and I've made budget calls on those lies more times than I'd like to admit.

Not lying maliciously. Lying the way a tape measure lies when everyone on the build is using a slightly different one. Meta says a sale was theirs. Google claims the same sale. Your post-purchase survey says the customer heard about you from a podcast. They can't all be right, and the honest answer is none of them are, exactly.

The big operators solved this by spending real money on it. They run formal lift platforms, media mix models, the lot. If you're doing eight figures and you've got a director of data, go and do that. But if you're a 6-7 figure brand, you don't need a five-figure-a-year measurement stack to get most of the way there. You can approximate it with a spreadsheet, a calendar, and the discipline to leave things alone for two weeks.

Here's the thing about incrementality. It's just one question asked plainly: if you turned this spend off, how much of the revenue would still show up anyway? Everything that would've happened regardless isn't incremental. You're paying for it, but you're not causing it. And until you've actually tested that, you genuinely don't know which of your channels are engines and which are just taking credit at the finish line.

So let me walk you through how to run your first holdout test yourself. No platform required.

Why coupon codes and platform ROAS both mislead you

Start with the two numbers most brands trust by default, because both are weaker than they look.

The first is coupon-code or "link-attributable" tracking, the kind you lean on for direct mail, influencers, anything offline. You put a code on the postcard - SAVE10 - and you count the redemptions. Feels clean. It isn't. Most people never use the code. They get the card, it sits on the counter for a week, and when they finally buy they grab a different code off the site or off Honey, or they just forget the card existed and buy anyway.

I've seen the studies the direct mail platforms have run on this, and the gap is bigger than you'd guess. Coupon redemptions typically understate the real impact of a campaign by a factor of three to seven. So a mailer that "drove" ~$50k through codes might actually be driving closer to ~$200-350k once you measure it properly. If you killed that channel off the redemption number, you'd be cutting one of your best performers and never knowing it.

The second number is in-platform ROAS, the one Meta reports back to you. The problem there runs the other way. The platform claims credit for people who were already going to buy. This is the bit that genuinely changed how I think about Meta: when brands tighten their bid strategy or get aggressive with ROAS targets, the system quietly drifts toward people who already know them, because that's the cheapest way to hit the goal you set. You ask for a customer at $20, and it hands you someone who already bought from you, because finding a genuinely new buyer would've cost $80. Same reported ROAS. Completely different reality.

One brand I worked with had a returning-customer exclusion running on all their prospecting and still found, when they checked server-side, that the majority of "acquired" purchasers in a session were existing customers. The platform was billing them to re-buy people they already owned. That's not a bug. That's the machine doing exactly what it's built to do.

So one number understates, the other overstates. A holdout test is how you stop guessing which.

The three scrappy holdouts we actually deploy

You don't need one perfect method. You need the cheapest one that fits the channel you're worried about. Here are the three I reach for, easiest first.

1. The structured pause (the MER delta test). This is the bluntest version and the one I'd start with. You pick a channel, turn it off completely for a defined window - say two weeks - and you watch your MER, your total revenue divided by total spend across everything. Not the channel's own ROAS. The whole business.

The logic is simple. If you cut ~$15k of Meta spend and total revenue barely moves, that spend wasn't very incremental - the demand found another door. If you cut it and the whole business drops hard, it was doing real work. The trick is picking a clean window. No major sale, no product launch, no email blast distorting the read. Compare the pause fortnight against a normal trading fortnight just before it, and look at the delta in MER. It's crude, but it's honest, and it costs you nothing but nerve.

2. The geo split. A step up in rigour. Instead of turning a channel off everywhere, you turn it off in some regions and leave it running in others, then compare. Hold out a chunk of postcodes or states, keep the rest live, and measure revenue per region across the test.

The reason this is better than the pause is that your control regions absorb the seasonality for you. If the whole market dips that fortnight, it dips in your live regions too, so the gap between held-out and live is a much cleaner read on what the channel actually caused. This is the closest the scrappy version gets to what the platforms charge for. You're running a proper experiment - same brand, same period, the only difference being whether the channel was on.

3. The segment holdout. This one's for anything built off your own customer list - email, SMS, direct mail, retargeting. You take an audience, randomly hold back a slice, and send to the rest. Ten thousand people in the segment, send to eight thousand, hold out two thousand, then compare revenue per person between the two groups.

That's the whole test. Revenue per held-out person versus revenue per messaged person. If the messaged group earns you meaningfully more per head, the spend is incremental. If the two groups look the same, you're paying to reach people who were going to buy regardless. It's the gold standard for list-based channels precisely because it's so hard to argue with - same audience, same week, split at random.

Calibrating channel by channel (where the surprises live)

Run these and you'll find the read varies wildly by channel, often opposite to your gut. A few patterns worth setting expectations on.

Direct mail almost always reads better than its codes suggest. Remember that 3-7x gap. When you measure direct mail with a proper holdout instead of redemptions, the incremental ROAS routinely lands well above what the codes implied. That coffee-table effect is real - a good catalogue sits in the house and pulls traffic for a week, long after a redemption window would've closed the book on it. If you've been judging mail on codes alone, you've almost certainly been underrating it.

Retargeting usually reads worse. This is the one that stings. Retargeting reports beautifully in-platform because warm people convert cheaply, but a holdout often shows a big chunk of those people were coming back regardless. The platform took credit for a sale that was already on its way. I'm not saying kill retargeting - it catches the genuinely on-the-fence ones - but I'd be very sceptical of an 8x retargeting ROAS until a holdout has confirmed it. Between the two, coupon-code thinking understates mail and platform reporting overstates retargeting, and the spread between belief and reality can easily be that 3-7x.

Engaged audiences can be more incremental than cold ones, which feels backwards. Here's a genuinely counterintuitive one from list testing. You'd assume the most incremental people to mail are the lapsed, unengaged ones - they're not hearing from you elsewhere, so surely the mail does the work. Tested properly, brands keep finding the opposite: their engaged segments produce higher incremental returns. The read seems to be that when someone's already near a buying decision, an extra touch tips them over. Worth testing before you build your whole segmentation on a hunch.

The lesson across all of it: don't assume. Run the held-out version and let the channel tell you.

How big it needs to be to trust it

A holdout that's too small tells you nothing, and a false read is worse than no read. So before you act on anything, sanity-check the size.

For a segment holdout, you want enough people and enough orders on both sides that the difference isn't just noise. A held-out group of a few hundred buyers is thin - one big order swings the whole thing. I'd want a few thousand people in the segment before I'd lean on the result, more if your prices are high and your order count is low.

For a geo split or a structured pause, the equivalent question is spend and time. If a channel is only ~3-5% of your total budget, a two-week pause might not move MER enough to read above the daily wobble of the business - the signal's smaller than the noise. The channels worth testing first are the big ones, the ones eating 30-50% of spend, because that's where a wrong assumption is costing you the most and where the read will actually show up. Test your biggest spend before you go chasing the incrementality of some 2% channel.

And give it long enough. A few days isn't a test, it's weather. Two weeks is roughly the floor for most brands, longer if your consideration cycle is long.

What to do once you've got the read

Say you've run it and the number's in. Now the discipline.

If a channel comes back strongly incremental, that's permission to push - you've confirmed the spend is causing revenue, not just reporting it. If it comes back weak, you've got a decision: cut it, or move that budget somewhere a test has shown actually moves the business. This is exactly how the sharper operators reallocate - they spend less on the channels that turned out non-incremental and pour it into the ones that held up, even when the in-platform ROAS would've told them to do the opposite.

The mindset shift I'd want you to take from all this: stop chasing a single true number. You won't find it - perfect attribution doesn't exist, and the brands that wait for it act too late. What you're after is a shared, good-enough read that everyone on the team agrees to make decisions against. A holdout gives you that. It's the calibrated measuring stick the whole build can work off.

You can absolutely run the first one yourself this month. Pick your biggest-spend channel, pick a clean fortnight, and run the structured pause. It'll be rough, it'll be honest, and it'll teach you more about your account than another dashboard ever will.

If you'd rather not guess at the setup - how big the holdout needs to be, which channel to test first, what counts as a trustworthy read for your spend level - that's the kind of thing we map out in a Signal/Noise Audit before a brand bets budget on it. We'll look at your accounts and your unit economics and tell you exactly where a holdout would pay for itself fastest, and what the minimum it needs to be believable looks like.

So here's the question I'd sit with: of all your channels, which one are you most afraid to turn off for two weeks? That fear is usually pointing straight at the test you most need to run.

Ethan To
CEO @ Pigeon Digital