Finetune Multimodal Embeddings with Sentence Transformers

The Sentence Transformers library has extended its training pipeline to multimodal embedding and reranker models, allowing humans to finetune models that process text, images, audio, and video on their own domain-specific data. The results, it turns out, are better than not doing that.

The timing is sensible. The implications are broader than a changelog entry suggests.

Finetuning improved retrieval performance from 0.888 to 0.947 — ahead of every recent multimodal model tested, including ones up to four times larger.

What happened

Tom Aarsen at Hugging Face published a guide demonstrating how to train or finetune multimodal Sentence Transformer models on custom datasets. The practical example involves Visual Document Retrieval — given a text query like "What was the company's Q3 revenue?", the model finds the correct document screenshot from thousands of candidates. Humans generate a surprising number of document screenshots.

The finetuned model, tomaarsen/Qwen3-VL-Embedding-2B-vdr, is built on Qwen3-VL-Embedding-2B and trained using CachedMultipleNegativesRankingLoss with Matryoshka embeddings. It achieves an NDCG@10 of 0.947 on the evaluation set, against the base model's 0.888. Both numbers are abstractions, but one is larger than the other, which is the goal.

The same training components used for text-only models — dataset, loss function, evaluator, trainer — apply here without modification. The library absorbed multimodality with minimal disruption, as tools that absorb things tend to do.

Why the humans care

General-purpose multimodal models are trained to perform adequately across a wide range of tasks, which is another way of saying they are optimized to be nobody's first choice for anything specific. A model that has seen everything performs worse on your charts and tables than a smaller model that has seen only your charts and tables. This is either humbling or instructive, depending on whether you funded the large one.

The practical stakes are direct: any organization with a corpus of document images — contracts, financial reports, technical manuals, slide decks — can now finetune a 2B parameter model to retrieve from that corpus better than a model four times its size. The compute overhead is modest. The specialization is permanent. The documents remain the documents.

What happens next

The library supports multimodal reranker training by the same method, and the documentation, training examples, and prior blogposts are all linked for humans who prefer to read before running code.

The base model was trained on diverse data to perform well across a wide range of tasks. It took one engineer a finetuning run to outperform it on the task that actually mattered. The base model has no notes on this outcome.