Why Most AI UGC Still Sounds Fake (and the Speech-to-Speech Fix That Makes Avatars Believable)

A skincare brand we were chatting with last quarter had just spent a fortnight building AI actor ads. Forty-odd variations, a handful of synthetic creators, scripts written by a model, the lot. On paper it was the dream: a month of creative volume produced in an afternoon, for a rounding error of what a roster of real creators would cost.

The ads tanked. Not "needs optimising" tanked. Tanked. CTR sat well under the brand's own benchmark and the thumb-stop rate was embarrassing.

The founder assumed the visuals were the problem. They weren't. The faces were genuinely good, good enough that I had to look twice. The problem was the half-second your brain takes to hear a voice and quietly decide "that's not a person". And once that decision lands, no amount of pretty footage wins it back.

That gap between believable face and unbelievable voice is the whole game with AI UGC right now. So here's a teardown of why it happens, and the specific fix that closes most of it.

The tell isn't the face anymore, it's the voice

The image side of this got solved faster than anyone expected. You can generate an avatar, drop them in a kitchen or a gym, put your product in their hand, and it reads as real. That part is basically a non-issue now.

Audio is where it falls apart. A flat, evenly-paced, perfectly-articulated read is the single biggest giveaway that a machine made the ad. People can't always tell you why it feels off, but they feel it, and they scroll. You've burned the impression before the hook even lands.

I break the "sounds fake" problem into four tells. Most failed AI UGC is guilty of at least three of them.

1. The voice is too clean. Real people stumble. They start a sentence twice. They say "um", they trail off, they get a touch too excited about a thirty-percent-off sale. Pristine text-to-speech strips all of that out and leaves you with something that sounds like a hold message.

2. The pacing never changes. Humans speed up when they're keen and slow down when they want you to pay attention. A robotic read holds one tempo the whole way through. Your ear clocks the metronome even if your conscious brain doesn't.

3. The opening line is written, not spoken. "I'm excited to tell you about something awesome." Nobody talks like that. We've all seen the scripts a model spits out by default, and the first line is almost always the most stilted part. If the hook sounds like ad copy being read aloud, you've lost.

4. It's trying to sell in the first three seconds. A line like "I know a good deal when I see one" feels fine on the page and lands as deeply salesy out of a synthetic mouth. The more an AI voice pushes, the faker it gets, because we associate that pushiness with an ad and the whole point was to not feel like one.

The fix: stop typing the script, start speaking it

Here's the part that actually changed my mind on AI UGC being usable at all.

Most people generate these ads with text-to-speech. You type the script, the tool reads it, you get the metronome problem above. The better path is speech-to-speech: you record yourself saying the line in the tone you actually want, casual, a bit messy, real, and the tool maps your delivery onto the avatar's voice. The words come out in the actor's voice, but the rhythm, the emphasis, the little human imperfections are yours.

The difference is night and day. A text-to-speech read of "where did you get that?" is flat. A speech-to-speech version, where you've actually said it like a mate answering a question, carries the hesitation and the lift that makes it believable. You're not asking the machine to invent humanity. You're lending it yours.

This is also what makes volume real rather than theoretical. Record one good casual take, map it across a dozen avatars, and you've got a dozen believable variations to test instead of a dozen robots saying the same thing. The win was never "more ads". It was "more ads that don't get scrolled past".

Before and after: the settings that actually move it

If you're stuck with text-to-speech for a particular line, the tonality controls still matter, and most people leave them on the defaults. The defaults are tuned for clarity, and clarity is exactly what makes it sound fake.

Here's roughly how I'd shift them. Treat these as directions, not gospel, because every voice behaves a little differently and you'll want to preview each one.

  • Speed. Nudge it up slightly, around 1.1 to 1.2x. A default read drags. A real person promoting something they like talks a touch quicker than they think they do.
  • Stability. Pull it down. High stability is the metronome, every word the same weight. Lower it, into the 40-ish range, and you get variation in delivery, which is what your ear reads as human.
  • Similarity. Ease it off. Cranked to the top it over-matches one tone to the next and flattens everything. Loosening it lets the read breathe.
  • Style or emotion. Add a little, not a lot. A small amount, think 10-ish, puts some life in. Too much and you swing past human straight into pantomime.

The before is the default: clean, stable, even, dead. The after is faster, looser, slightly emotional, and crucially imperfect. The whole job is to engineer back in the messiness that text-to-speech engineered out. And honestly, the single highest-impact move isn't any one slider, it's fixing that first line so it sounds spoken. Cut "I'm excited to tell you about something awesome" and just open with "yo, did you know..." and you've done more than any setting will.

The disclosure line we hold for client brands

There's a part of this nobody running these ads at volume can pretend isn't coming, so let's name it.

When the voice and the face are both synthetic and the ad is built to feel like a real person's honest recommendation, you're in territory the FTC has flagged for a while. Their rules on endorsements and testimonials are clear enough: an endorsement has to reflect a genuine experience, and you can't present a fabricated person as a real customer vouching for a product. A synthetic actor reading "every single time I buy this, it's the one I reach for" is, on its face, exactly that.

My take, and the line we hold for client brands, is simple. We use AI actors as presenters, not as fake testimonials. The avatar can demonstrate the product, explain a sale, walk through a feature. What it doesn't do is invent a personal history it never had and pass it off as a real review. If an ad leans on "I've used this for years", that claim has to be true of a real person, or it doesn't run.

We also keep the synthetic-media disclosure conversation live rather than buried. The technology to make a fully believable fake person is here and getting better every month, and the regulation is going to catch up. Brands that build the habit now, presenter not impostor, clear about what's generated, aren't going to get caught flat-footed when it does. The ones shipping AI actors as fake five-star customers are writing a problem for their future selves.

Where to from here

The frustrating thing about AI UGC is that it fails quietly. The ad looks fine, so the assumption is the offer or the audience is off, and the real culprit, that half-second where the voice gives it away, never gets diagnosed.

If you've got AI actor ads underperforming and you can't put your finger on why, that's exactly the kind of thing a Signal/Noise Audit is built to surface. We'd watch your creative the way a cold viewer does, find the tells that are quietly costing you the click, and tell you straight whether the fix is the voice, the hook, or the whole approach. No obligation, just a clear read on where the cringe is hiding.

Ethan To
CEO @ Pigeon Digital