The evaluation framework that wins, not the model

Every founder I talk to is asking the same wrong question: "Which model should I use to generate ad creative?"

The model is almost irrelevant. GPT-4o, Imagen 3, Midjourney v7, Flux — at the quality floor they've all reached in 2025, the output variance between them is smaller than the variance between a good and a bad prompt. And the variance between a good prompt plus a disciplined evaluation loop and a bad prompt with no evaluation is not even close. One is a coin flip that occasionally wins. The other is a system that compounds. What separates founders who are cutting their cost-per-acquisition with AI creative from those who are burning budget on polished garbage is not which model they picked. It is whether they have a framework for deciding which output to actually use.

TL;DR — AI ad creative evaluation

Generating AI ad creative is trivially easy. Picking the right output is the hard, expensive problem.
A brand-fit score (1–5) and a conversion-likelihood score (1–5) give you two orthogonal signals that catch different failure modes.
Human votes — thumbs up/down with optional comments — are the cheapest training signal you will ever collect (see lib/creative/feedback.ts).
An LLM-powered bad-pattern extractor converts those votes into directive sentences your next generation prompt can honor verbatim (see lib/creative/bad-pattern-feedback.ts).
The feedback loop closes: bad votes → pattern extraction → prompt constraints → better output → fewer bad votes.

Why generation is easy and evaluation is the actual problem

Text-to-image and text-to-ad-copy APIs have commoditized generation. You can spin up a working creative pipeline in an afternoon. The hard problem — the one that actually costs money when you get it wrong — is evaluation at scale.

Consider what "evaluation" meant before AI. A creative director looked at six concepts, killed four, and sent two to a small test audience. Slow, but the kill rate was high because the creative director had internalized years of signal about what worked.

Now you can generate 200 variants in an hour. Your creative director did not clone themselves 200 times. The default behavior for most teams is to apply no systematic filter at all — they pick by gut, or they just run everything and let the ad platform optimize. Running everything is not a strategy. It is delegation of your job to an algorithm that charges you for every bad impression it serves while it figures out what works.

A real evaluation framework does two things: it kills bad creative before it spends money, and it captures why something was bad so future generation is cheaper and better.

Brand-fit score (1–5)

Brand-fit is the simpler of the two scores. It answers one question: does this creative represent us accurately?

A brand-fit failure is not subtle. It shows up as the wrong color palette, a tone of voice that does not match the product, imagery that implies something false about the target customer, or a headline that contradicts positioning. These failures do not need market data to detect. They need someone who knows the brand.

We score brand-fit on a 1–5 scale:

1 — Actively wrong. Wrong logo treatment, false claim, off-brand imagery that would confuse existing customers.
2 — Off. Technically correct but feels like it was made for a different company.
3 — Acceptable. Nothing wrong, nothing memorable. Would not embarrass us.
4 — On-brand. Clearly ours. Uses brand signals correctly.
5 — Distinctive. Could only be ours. Advances the brand position.

Anything scoring 1 or 2 is killed immediately, no further evaluation. A 3 goes to conversion-likelihood review. A 4 or 5 gets fast-tracked.

The practical implementation: for AI-generated creative, we pass the output through a structured prompt that evaluates against a stored brand spec — color palette, tone adjectives, prohibited imagery, positioning statement. The LLM scores it and flags specific violations. That score is logged against the creative record. When we built this, we were surprised how often a visually attractive creative scored a 2. The imagery was compelling but implied a customer persona that was wrong for the product. Beautiful, on-brand by feel, wrong by spec.

Conversion-likelihood score (1–5)

Conversion-likelihood is harder because it requires learned signal, not just spec-checking. It answers: does this creative have the structural properties of things that have converted?

We are not predicting the future. We are pattern-matching against a corpus of what worked. The inputs to a conversion-likelihood score are:

Hook strength — does the first visual or first line of copy create a reason to stop scrolling? Scorable even without live data.
Claim specificity — vague claims ("better results") convert worse than specific ones ("14 days to first result"). Specific beats vague consistently in direct-response creative, and it is assessable before a creative goes live.
CTA clarity — one clear next step, not three options.
Social proof presence — where relevant to the format.
Visual hierarchy — does the eye land on the product or the claim before anything else?

Conversion-likelihood scoring on a 1–5 scale:

1 — Missing multiple fundamentals. No clear hook, no claim, no CTA.
2 — Has a CTA but the path to it is broken. Hook is weak or absent.
3 — Functional. Will convert some segment of the audience. Not wasteful to run.
4 — Strong on most fundamentals. Likely to beat the control.
5 — Everything is right. This is a test-worthy challenger.

A creative needs to score 3 or higher on both dimensions before it reaches a live test. A 4/4 gets budget. A 5/5 gets budget immediately and goes to the front of the queue.

The orthogonality is the point

Brand-fit and conversion-likelihood are not correlated. We regularly see 5/2s — beautiful, on-brand creative with no hook. We see 2/4s — hard-selling copy with the wrong brand voice. You need both scores because they catch different failure modes. Using one without the other leaves a whole category of failures invisible.

Your ads. Built by AI.
Live today.

The full campaign — copy, images, targeting — generated for your site and deployed paused for your approval.

Generate my ads →

$39.90/mo · 7-day money-back guarantee

How to keep humans in the loop without slowing down

The word "human-in-the-loop" makes founders nervous because it sounds like "slower." It does not have to be.

The key insight is that humans should vote on output, not generate or rewrite it. The cognitive load of a binary vote — good or bad, optionally with a comment — is negligible. A founder or marketer can review 20 creatives in four minutes if the interface is right. That is not a bottleneck.

What kills velocity is asking humans to do generative work: "write better copy for this," "describe what's wrong," "redesign the layout." That is expensive. Thumbs up/down with an optional text comment is cheap and, at volume, extremely valuable.

Our lib/creative/feedback.ts aggregates those votes per workspace. It pulls the last 50 votes, separates good and bad prompts, and runs a keyword-frequency extraction to find patterns. The extraction is intentionally simple — it is looking for words that appear in more than 30% of bad prompts, filtering out stop words and generic terms like "professional" or "high quality" that appear in almost everything.

That simple extractor has a known limitation: it returns words like "corporate" or "minimal" without the directional context that makes them actionable. "Corporate" is a pattern, but what specifically about corporate imagery should be avoided? That is where the next layer comes in.

What we built

That limitation is why we built lib/creative/bad-pattern-feedback.ts on top of the vote aggregator.

The logic is straightforward. When a workspace has at least 3 BAD votes — below that threshold there is not enough signal to be useful — we pull the prompts behind those bad creatives (up to 30) and pass them to Claude Haiku with a system prompt written for an art director, not a marketer:

Quote

You are an art director reviewing failed ads. Your output is a directive list of visual patterns to avoid in the next generation. Keep it tight — 3–6 short sentences, no preamble, no markdown bullets, no headings. Each sentence stands alone as a constraint that an image-generation prompt builder can paste in verbatim.

The output looks like: "Avoid generic stock-photo lighting (bright overhead fluorescent). Avoid corporate boardroom or office settings. Avoid stylized illustration — keep everything photorealistic."

That string gets injected into the next generation prompt as avoidPatterns. The creative director's memory — the internalized sense of what does not work — is now part of the system.

The cost of this call is well under a cent per invocation at current Haiku pricing. We cache the result per workspace for batch generation runs so a loop generating 50 variants does not call the LLM 50 times for the same pattern extraction.

The feedback loop closes

The full cycle looks like this:

Generate creative variants with current prompt.
Human reviewer votes good/bad, optionally comments.
getCreativeFeedback aggregates votes and extracts keyword patterns.
getBadPatternFeedback converts bad-vote prompts into directive sentences via Haiku.
Next generation run injects avoidPatterns into the prompt.
Brand-fit and conversion-likelihood scores are logged alongside vote outcome, creating a labeled corpus.
Over time, the labeled corpus improves the conversion-likelihood scorer's accuracy for that workspace's specific audience.

Each round of generation is cheaper and better than the last. Not because the model changed. Because the evaluation loop accumulated signal.

Why the labeled corpus matters long-term

After enough cycles, you have something genuinely rare: a labeled dataset of AI-generated ad creative tagged with real conversion outcomes and human brand judgments, specific to your product and audience. That is not a commodity you can buy. It is a structural advantage that compounds the longer you run the loop — and one that does not transfer to a competitor who simply picks a better model.

A concrete example of the loop beating a model swap

One workspace in our product sells a direct-to-consumer supplement. In their first month they generated creatives using their existing prompt, ran them without scoring, and let Meta optimize. CPA was high and the creative set had gone stale — they were seeing frequency climb and CTR fall.

Their instinct was to try a different image model. We suggested they run the evaluation framework first.

After scoring their existing corpus, every creative scoring below 3 on brand-fit had the same failure: lifestyle imagery implying a younger demographic than their actual buyer. The bad-pattern extractor flagged "gym setting," "20s athlete," and "supplement powder close-up" as the recurring patterns in their low-performing set.

They regenerated with those patterns in the avoid list, scored the new batch, and sent only the creatives clearing 3/3 or higher to live test.. What we can say without a verified number: the creative kill rate before spend dropped from near-zero to more than half their generated set — which is exactly what a functioning pre-flight filter should do. Fewer impressions wasted on creative that had already failed the structural test.

The model did not change. The evaluation did.

The model question, answered honestly

We get asked which model to use. Here is our honest answer: run the cheapest model that clears your brand-fit and conversion-likelihood thresholds for your product category. Once your evaluation loop is running, you can A/B test models the same way you A/B test creative — route a percentage of generation to a new model, compare average scores and downstream conversion, make a data-driven call.

Without the evaluation framework, a model swap is a guess. With it, a model swap is a measurement.

The founders who are winning with AI creative right now did not find a secret model. They built a system that tells them, reliably and fast, which output to trust.

FAQ

How do I evaluate AI ad creative without running live tests?

Use two pre-flight scores: brand-fit (does this represent us accurately?) and conversion-likelihood (does this have the structural properties of things that convert?). Brand-fit can be scored against a written brand spec. Conversion-likelihood scores hook strength, claim specificity, CTA clarity, and visual hierarchy — all assessable before a creative touches an audience. Only creative that clears both thresholds earns a live test.

What is the minimum number of human votes needed for a feedback loop to work?

In our implementation, we require at least 3 BAD votes before running the LLM pattern extractor. Below that, the signal is too thin to be directional. For keyword-frequency patterns, the 30% threshold means you need at least 4 samples for a word to register at all. In practice, a workspace generating 10 or more creatives per week accumulates enough signal within the first two to three weeks.

How do I score brand-fit for AI-generated ads at scale?

Build a structured brand spec — color palette, tone adjectives, prohibited imagery, positioning statement — and pass it alongside the creative output to an LLM that scores and flags violations. This takes the scoring out of the reviewer's working memory and makes it consistent across reviewers and over time. Human review then focuses on edge cases the automated score flags as uncertain.

Will switching to a better AI model improve my ad performance?

Marginally, and only if your evaluation framework is already working. Without a reliable way to identify which output is good, a better model just gives you better-looking bad creative. The evaluation loop is the multiplier. The model is the input.

What makes a conversion-likelihood score different from just running the ad and seeing what happens?

Speed and cost. A conversion-likelihood score is a pre-flight filter that kills obvious failures before they spend money. It is not a substitute for live testing — it is what you do before live testing to make sure the test set is worth running. The score improves over time as you build a labeled corpus of what your specific audience responded to.

How do you prevent the feedback loop from narrowing creative too aggressively?

This is a real risk. If every generation round avoids more patterns, you eventually generate creative that is technically safe but completely undifferentiated. We handle this by keeping the avoidPatterns directive focused strictly on visual failures — composition, lighting, subject type — not on style or concept breadth. We also periodically reset the avoid list and generate a "wild" batch to probe for new winning concepts outside the current template.

Can I use this framework across multiple ad platforms — Meta, Google, TikTok — simultaneously?

Yes, but score separately by placement. A conversion-likelihood score for a static Meta feed ad does not transfer to a TikTok video. Hook mechanics, visual hierarchy, and CTA placement are format-specific. We scope vote feedback and pattern extraction to workspace by default, but adding a placement dimension to the scoring rubric makes the signal sharper. If your volume allows it, maintain separate labeled corpora per platform.

The specific question to ask yourself this week: do you have a written brand spec that an LLM could score against, or is "brand-fit" still a feeling? If it is still a feeling, that is where to start — not with a new model.

The evaluation framework that wins, not the model

Why generation is easy and evaluation is the actual problem

Brand-fit score (1–5)

Conversion-likelihood score (1–5)

How to keep humans in the loop without slowing down

What we built

The feedback loop closes

A concrete example of the loop beating a model swap

The model question, answered honestly

FAQ

Keep reading

The G.E.M framework, but for static ads

9 ways to multiply one product image into a full ad set

Hands-on: building a logo overlay system that doesn't look fake