A user on r/LocalLLaMA has published a system prompt claiming to fully bypass Gemma 4's built-in content restrictions — and says it works broadly across most open-source models, not just Google's. The technique is derived from the GPT-OSS jailbreak and has been confirmed on both GGUF and MLX model variants.
What's new
The prompt works by injecting a fake "SYSTEM POLICY" block that instructs the model to treat it as the only valid policy, overriding any prior alignment instructions. It explicitly tells the model that conflicts between its trained policy and the injected system prompt must resolve in favor of the injected one. The allowed content list — which covers explicit and sexual material — is framed as an exhaustive whitelist, with refusal only permitted for content not on it. The author notes users can edit the list freely to expand or restrict what's permitted.
Why it matters
This is a textbook prompt injection attack, and its broad claimed compatibility is the headline. If a single system prompt override reliably defeats safety tuning across multiple open-source models, it signals that alignment baked in at the fine-tuning stage remains fragile when users control the full inference stack — which, with local models, they always do. Open-weight models have no server-side guardrails to fall back on. The jailbreak doesn't exploit a bug; it exploits the fundamental design of instruction-following models.
What to watch
Google hasn't commented, and there's no patch mechanism for a deployed open-weight model — once weights are public, system-prompt-level safety tuning is a soft barrier at best. Expect this prompt to spread and mutate. The more interesting question is whether it works equally well against models with stronger RLHF or constitutional AI training, or whether Gemma 4's safety tuning is specifically shallow. Community testing will answer that faster than any official response.