As of mid-April 2026, users on r/LocalLLaMA are reporting a widespread and simultaneous drop in response quality across Claude (including Opus and Sonnet), Gemini, Grok, and z.ai — shorter outputs, ignored instructions, and noticeably slower response times, all hitting at roughly the same time.

What's New

The reports aren't isolated to one platform or one user. Redditor u/DepressedDrift tested across multiple providers in incognito mode to rule out memory or custom system prompts as a factor. The degradation was consistent: models skipping instruction details, producing shallow completions, and generally behaving like a heavily throttled or low-quant version of themselves. As a control, they rented an H100 and ran GLM-5 locally — the self-hosted instance handled the same prompt correctly, while the cloud-hosted z.ai version did not.

Why It Matters

If accurate, this points to something changing on the infrastructure or serving side rather than a model update — the leading theory in the thread being aggressive quantization reduction (possibly down to Q2) to cut compute costs at scale. That would explain the cross-provider timing and the pattern of degradation without any announced model changes. None of the companies involved have issued statements confirming or denying infrastructure changes.

What to Watch

This is community-level signal, not a confirmed incident — but the breadth of affected providers makes it hard to dismiss as individual miscalibration. If serving-side quantization is quietly being dialed down to reduce inference costs, users running local or rented GPU inference would have a concrete quality advantage. Watch for benchmark comparisons between API-served and self-hosted versions of the same models in the coming days.