LLM-written ad copy: a 6-month performance comparison

The discourse on AI-written ad copy has settled into two camps. One side says LLMs produce generic slop that always underperforms humans. The other side says LLMs are already better than 80% of working copywriters. Both camps tend to make their case from anecdote.

We've been running this comparison for 6 months across our own campaigns and a handful of customer accounts. Here's what the numbers actually show.

The setup

Same campaigns, same budgets, same audiences. Half the ads in each ad-group were written by a human (us), half by Claude Sonnet using a prompt template that knows our voice guide. CTR and conversion rate measured weekly. Total: 12 ad-groups across 4 accounts, ~300 ads, ~4M impressions over 26 weeks.

What the numbers showed

Three top-line findings:

1. Average CTR was nearly identical. Human-written ads averaged 3.2% CTR. LLM-written ads averaged 3.1%. Statistically indistinguishable across the 4M impressions.

2. Variance was very different. Human ads had a tighter distribution — fewer outliers in either direction. LLM ads had more high performers and more low performers. The variance ratio was roughly 1.8x.

3. Conversion rate diverged. Once a click landed on the page, conversion rates were 7% lower on average for LLM-written ads. The gap was small but consistent across categories.

editorial photograph of a desk with two stacks of printed ads side by side, one…

Where LLMs won

Three categories where AI consistently beat human writing:

Volume tests. When we needed 20+ variants for a single ad-group, LLM-written ads dominated. The top 3 by CTR in each volume test were almost always LLM. Humans don't naturally generate 20 distinct variants of the same ad.
Headline experimentation. LLMs were better at generating unexpected angles. Humans tend to repeat the same 3 framings ("save time," "increase revenue," "join thousands"); LLMs would offer 30 variations on the same ad-group.
Pure feature description. Where the ad was straightforward — "X does Y" — LLMs matched humans on output quality and beat them on speed.

Where humans won

Three categories where human writing consistently outperformed:

Brand-voice nuance. Subtle in-jokes, cultural references, or contrarian claims that depended on context — LLMs flattened these into generic versions. The human ads kept the personality; the LLM ads sanded it off.
One-of-a-kind angle posts. When the winning ad in an ad-group depended on a specific insight or story, humans owned it. LLMs would produce competent but interchangeable variants.
Conversion-critical CTAs. Subtle CTA wording that tested better on landing-page conversion (not just click) was almost always human-written. We don't fully understand why — possibly LLMs over-optimize for click and under-think the post-click moment.

The workflow we settled on

After 6 months of A/B testing, our actual production workflow:

LLM generates 30–50 variants for any new ad-group, fed our voice guide and the campaign brief.
Human reviewer keeps the top 8 based on brand fit and gut feel for which would convert.
Human writes 2–4 additional variants specifically to capture the "one-of-a-kind angle" that LLMs miss.
Combined set ships. Usually 10–12 ads per ad-group.
After 30 days, the bottom half is paused. The remaining 5–6 ads are usually a mix of LLM and human, with no consistent dominance from either.

The headline result: we ship more ads, faster, with about the same end-state performance as pure-human writing. The win isn't quality; it's speed and volume.

editorial photograph of a workflow diagram drawn in marker on a whiteboard, sho…

What the data doesn't tell us

A couple of honest caveats:

Sample size. 4M impressions sounds large, but it's spread across 4 accounts in different categories. Within any single account the sample is tighter. The conclusions might shift with a 10-account, 12-month version of the test.

Voice guide quality matters more than model. The LLM-written ads that performed well were the ones from accounts with rich voice guides. Accounts with thin voice guides produced LLM ads that read like every other LLM ad on the internet — and underperformed accordingly. The model is doing what you ask. If you ask vaguely, it produces a vague output.

Recency effects. Ad-platform algorithms favor newer ads with limited learning periods. LLMs let us refresh ad-groups more often, which might be giving them an unfair tailwind we haven't fully isolated.

What we'd test next

If we ran this again with more time:

Test the same protocol with a smaller, voice-trained model (a fine-tuned Sonnet vs. base Sonnet on the same accounts). Hypothesis: fine-tuning closes the brand-voice gap and the human edge shrinks.
Test multi-LLM ensembles — generate variants from Claude, GPT, and Gemini, then have a human pick. Hypothesis: ensembles hit the top of the distribution more often than any single model.
Test the post-click conversion gap directly — is it really the ad copy, or is the LLM-written ad attracting a slightly different visitor profile?

We'll publish the next round when we have the data.

What this means for ad operators

If you're not yet using LLMs for ad copy, the case is straightforward: at minimum, use them for variant volume. The CTR penalty (if any) is small enough that the speed lift dominates.

If you're already using LLMs for everything, the case is also straightforward: your top-performing ads probably still need a human touch. Don't fully outsource the brand-voice and conversion-critical CTA work.

Quote

LLMs are the world's best junior copywriter. The right question isn't whether to hire one — it's how to use one well.

LLM-written ad copy: a 6-month performance comparison

What the numbers showed

Where LLMs won

Where humans won

The workflow we settled on

What the data doesn't tell us

What we'd test next

What this means for ad operators

Keep reading

We tested 4 image models on the same ad — here's what won

How AI Image Generation Is Changing Ad Creative

Product fidelity in AI ads: catching when the model swaps your product