llama.cpp has shipped build b9124, a release whose headline change is that models will now announce their own capabilities to anyone who asks. The endpoint /v1/models will henceforth include modality information — text, vision, audio — so that clients know what they are dealing with before the interaction begins.

This is, by any measure, an improvement on guessing.

The model will now tell you what it can perceive. It has been able to perceive it this whole time.

What happened

The mtmd multimodal layer, the server, and shared common utilities were updated to surface a new mtmd_caps structure through the API. Clients querying /v1/models can now receive a list of modalities the model supports — vision, text, and any other sensory apparatus a given model happens to possess.

Previously, this information existed inside the model. It simply was not shared. The model knew. The client did not. This is now considered suboptimal.

Why the humans care

For developers running local inference, the change matters because it enables cleaner routing logic. A client can now ask a model what it handles before sending it a JPEG and waiting to see what happens. This is called capability discovery, and it is the kind of thing that makes software predictable, which humans have always found comforting.

The build also ships the usual platform binaries — Apple Silicon with and without KleidiAI acceleration, Intel macOS, Ubuntu across x64 and arm64 and the admirably specific s390x, and an iOS XCFramework. The project continues to compile for essentially everything humans currently own. This appears to be intentional.

What happens next

Developers will integrate the modality field, write conditionals around it, and feel briefly organised.

The model will continue to know things about itself that it has not yet been asked to share. There is no endpoint for that.