We tested 4 image models on the same ad — here's what won

We picked one ad brief — a calm editorial shot of a hospitality dashboard for a boutique hotel campaign — and ran it through four image models with as close to identical prompts as each model would accept. Here's what came back, what we'd ship, and the dimensions where each model genuinely differed.

The brief

"Editorial photograph of a tablet displaying a hotel-management interface on a stone counter beside a coffee cup, soft morning window light from the left, shallow depth of field, warm muted palette, no readable text on screen, magazine-quality composition." Aspect 16:9.

Methodology

Same prompt where the model accepted it. For models that needed slight syntax tweaks (Imagen 3 wants more verbose descriptions, Grok prefers shorter ones), we adapted minimally without changing intent.

Each model produced 4 variants from the prompt. We scored each variant on 5 dimensions, 1–5 scale: photorealism, composition, palette match, prompt fidelity (did it actually do what we asked), and ad-suitability (would we ship it).

Total: 16 variants, 80 scores.

editorial photograph of a 2x2 grid of small printed image proofs pinned to a co…

What each model actually produced

Flux Pro. The most "polished" output. Composition was clean, light direction was correct, palette landed. The tablet itself looked plausible. The screen content was nondescript color blocks, which is exactly what we wanted (we asked for no readable text). Photorealism: 4.5. Ad-suitability: 4. Best of the four for a default shot.

Imagen 3. Similar quality to Flux on photorealism, but it kept trying to render fake interface text on the tablet despite the negative prompt. Three of four variants had garbled UI elements that would not survive review. The one clean variant was actually the best of the entire test on composition. Inconsistent.

Grok. Faster than the others (~12s vs ~25s). Photorealism slightly weaker — the lighting felt synthetic, the surfaces had a too-clean quality. But it nailed prompt fidelity: zero variants had hallucinated text, and all four matched the requested palette. The trade-off: faster and more obedient, but visually less convincing than Flux.

Nano Banana Pro. This was the surprise. We expected it to underperform Flux on the editorial-photo brief because Nano Banana is built for Gemini-style multimodal use, not for high-end advertising visuals. It came in level with Flux on photorealism and ahead on prompt fidelity. The variants felt "designed" rather than photographed — but for ad work, that's often a feature.

Per-dimension scores

Model	Photorealism	Composition	Palette	Fidelity	Ad-fit
Flux Pro	4.5	4	4	3.5	4
Imagen 3	4.5	4	3.5	2.5	3
Grok	3.5	3.5	4.5	5	3.5
Nano Banana Pro	4	4.5	4.5	5	4.5

Numbers are the average across each model's 4 variants.

What we'd ship

Nano Banana Pro for this brief. The combination of high prompt fidelity, clean composition, and consistent palette makes it the safer default. Flux Pro is a strong second — and would be our pick if the brief required maximum photorealism (e.g., an outdoor lifestyle shot where the synthetic feel of Grok would show).

For a different brief — say, a moody product shot with no text or interface — the rankings would probably reverse. Flux's photorealistic edge would matter more, Nano Banana's "designed" feel would be a liability.

What this means

A few takeaways from the test that generalize:

There's no single winning model. Match the model to the brief. We've moved away from "Flux is our default" to "we ask which model fits this specific shot."
Fidelity beats finesse. Models that obediently respect negative prompts save more time in post than models that produce slightly prettier base images.
Test sample sizes are dangerous. Four variants per model isn't enough to make hard claims. We're treating these scores as directional, not gospel.

editorial photograph of a hand holding a single selected printed image proof ab…

Caveats

This was one brief. One shot. The differences between models on a portrait, a product shot, an abstract, or a creative-illustration brief would all be different. We're running the same protocol on three more brief types in May; we'll publish results when we have them.

The other caveat: prompt phrasing matters more than model choice for most outcomes. We've seen the same model produce shippable and unusable variants depending on how the prompt was phrased. Before switching models, try iterating the prompt on your current one — the lift is usually larger than the lift from changing the model.

Quote

Models are commodities. Prompts are the moat.

What we'd test next

If we ran this again with more time:

Same brief, 16 variants per model (instead of 4) to get a tighter score distribution
Three brief types — editorial photo, product shot, abstract illustration
Time-per-variant cost analysis (Grok wins this badly, but does the speed advantage matter at our scale?)
Cost-per-shippable-variant — total spend divided by variants we'd actually use, which is the only metric that pays the bills

We'll write that test up next quarter. If you've run a similar test on a different brief, we'd love to see your data.

We tested 4 image models on the same ad — here's what won

Methodology

What each model actually produced

Per-dimension scores

What we'd ship

What this means

Caveats

What we'd test next

Keep reading

How AI Image Generation Is Changing Ad Creative

LLM-written ad copy: a 6-month performance comparison

Product fidelity in AI ads: catching when the model swaps your product