We tested 4 image models on the same ad — here's what won
Flux Pro, Imagen 3, Grok, and Nano Banana Pro on the same brief. The winner wasn't the one we expected, and the loser was a surprise too.


We picked one ad brief — a calm editorial shot of a hospitality dashboard for a boutique hotel campaign — and ran it through four image models with as close to identical prompts as each model would accept. Here's what came back, what we'd ship, and the dimensions where each model genuinely differed.
"Editorial photograph of a tablet displaying a hotel-management interface on a stone counter beside a coffee cup, soft morning window light from the left, shallow depth of field, warm muted palette, no readable text on screen, magazine-quality composition." Aspect 16:9.
Methodology
Same prompt where the model accepted it. For models that needed slight syntax tweaks (Imagen 3 wants more verbose descriptions, Grok prefers shorter ones), we adapted minimally without changing intent.
Each model produced 4 variants from the prompt. We scored each variant on 5 dimensions, 1–5 scale: photorealism, composition, palette match, prompt fidelity (did it actually do what we asked), and ad-suitability (would we ship it).
Total: 16 variants, 80 scores.

What each model actually produced
Flux Pro. The most "polished" output. Composition was clean, light direction was correct, palette landed. The tablet itself looked plausible. The screen content was nondescript color blocks, which is exactly what we wanted (we asked for no readable text). Photorealism: 4.5. Ad-suitability: 4. Best of the four for a default shot.
Imagen 3. Similar quality to Flux on photorealism, but it kept trying to render fake interface text on the tablet despite the negative prompt. Three of four variants had garbled UI elements that would not survive review. The one clean variant was actually the best of the entire test on composition. Inconsistent.
Grok. Faster than the others (~12s vs ~25s). Photorealism slightly weaker — the lighting felt synthetic, the surfaces had a too-clean quality. But it nailed prompt fidelity: zero variants had hallucinated text, and all four matched the requested palette. The trade-off: faster and more obedient, but visually less convincing than Flux.
Nano Banana Pro. This was the surprise. We expected it to underperform Flux on the editorial-photo brief because Nano Banana is built for Gemini-style multimodal use, not for high-end advertising visuals. It came in level with Flux on photorealism and ahead on prompt fidelity. The variants felt "designed" rather than photographed — but for ad work, that's often a feature.
Per-dimension scores
| Model | Photorealism | Composition | Palette | Fidelity | Ad-fit |
|---|---|---|---|---|---|
| Flux Pro | 4.5 | 4 | 4 | 3.5 | 4 |
| Imagen 3 | 4.5 | 4 | 3.5 | 2.5 | 3 |
| Grok | 3.5 | 3.5 | 4.5 | 5 | 3.5 |
| Nano Banana Pro | 4 | 4.5 | 4.5 | 5 | 4.5 |
Numbers are the average across each model's 4 variants.
Prompt fidelity matters more than raw photorealism for ad work. A perfectly photorealistic image with garbled UI text is a useless ad. A slightly less photorealistic image that does exactly what you asked is shippable today.
What we'd ship
Nano Banana Pro for this brief. The combination of high prompt fidelity, clean composition, and consistent palette makes it the safer default. Flux Pro is a strong second — and would be our pick if the brief required maximum photorealism (e.g., an outdoor lifestyle shot where the synthetic feel of Grok would show).
For a different brief — say, a moody product shot with no text or interface — the rankings would probably reverse. Flux's photorealistic edge would matter more, Nano Banana's "designed" feel would be a liability.
What this means
A few takeaways from the test that generalize:
- There's no single winning model. Match the model to the brief. We've moved away from "Flux is our default" to "we ask which model fits this specific shot."
- Fidelity beats finesse. Models that obediently respect negative prompts save more time in post than models that produce slightly prettier base images.
- Test sample sizes are dangerous. Four variants per model isn't enough to make hard claims. We're treating these scores as directional, not gospel.

Caveats
This was one brief. One shot. The differences between models on a portrait, a product shot, an abstract, or a creative-illustration brief would all be different. We're running the same protocol on three more brief types in May; we'll publish results when we have them.
The other caveat: prompt phrasing matters more than model choice for most outcomes. We've seen the same model produce shippable and unusable variants depending on how the prompt was phrased. Before switching models, try iterating the prompt on your current one — the lift is usually larger than the lift from changing the model.
Models are commodities. Prompts are the moat.
What we'd test next
If we ran this again with more time:
- Same brief, 16 variants per model (instead of 4) to get a tighter score distribution
- Three brief types — editorial photo, product shot, abstract illustration
- Time-per-variant cost analysis (Grok wins this badly, but does the speed advantage matter at our scale?)
- Cost-per-shippable-variant — total spend divided by variants we'd actually use, which is the only metric that pays the bills
We'll write that test up next quarter. If you've run a similar test on a different brief, we'd love to see your data.

We build AdControlCenter — AI-powered ad management for anyone running their own ads. We write what we'd want to read: real numbers, no fluff, the things we wish we'd known when we started.
More from the team →Keep reading
All posts →
How AI Image Generation Is Changing Ad Creative
Three years ago a single ad image cost $200 and a week. Today it costs ten cents and ten seconds. The economics of testing changed completely — but most operators are still running creative like it's 2023.

LLM-written ad copy: a 6-month performance comparison
We ran human-written ads against AI-written ads on the same campaigns for 6 months. The results weren't what either side of the debate predicted.

Product fidelity in AI ads: catching when the model swaps your product
AI image models will silently replace your actual product with a plausible-looking substitute — here's exactly how to detect and prevent it.