The "pelican on a bike" SVG generation test — a long-running informal benchmark in the local LLM community — is being retired. The reason: models are getting too good at it, or more accurately, too trained toward it. A thread on r/LocalLLaMA is pushing a replacement prompt: generate me an HTML SVG of a horse sitting in an F1 race car.
What's new
User Tall-Ad-7742 ran the new prompt across seven models and posted the results side by side. Gemini 3.1 Pro and DeepSeek Expert Mode produced the most recognizable outputs. Claude Sonnet 4.6 (via the official website, likely quantized), Qwen 3.6 Plus, Kimi K2.5, GLM 5.1, and MiniMax M2.7 also took the test — with results ranging from competent to abstract. The outputs are visual, so the judging is inherently subjective, but the gap between top and bottom performers is visible at a glance.
Why it matters
Informal visual generation benchmarks like this exist because formal evals lag behind real-world capability — and because they're hard to game without obvious overfitting. The pelican test worked precisely because it was obscure enough that no model team would specifically train for it. Once a test goes viral and gets referenced in enough benchmark discussions, that advantage evaporates. The community knows this, which is why the call to rotate prompts is happening now rather than after the next model cycle.
What to watch
The horse-in-F1-car prompt will likely follow the same lifecycle as the pelican: useful for a few months, then compromised. The more durable takeaway is the methodology — novel, compositionally complex SVG prompts that require spatial reasoning and multi-object rendering are a legitimate stress test for code generation and instruction following. Expect this prompt, or variants of it, to show up in community leaderboards shortly.