A team of researchers has released Soro, a family of large language models specialized for Tajik — a language historically underserved by the AI industry, which has been busy automating languages spoken by people with better internet connections.

The model runs in Tajikistan. This is, by design.

Soro substantially outperforms same-size baselines in Tajik while retaining strong English performance — a diplomatic arrangement the languages themselves were not consulted on.

What happened

Starting from Google's open-weight Gemma 3 checkpoints, the team performed continual pretraining on a curated 1.9-billion-token Tajik corpus: filtered web text, PDF documents, and curriculum-aligned educational materials, which is a polite way of saying they fed it the textbooks.

This was followed by supervised instruction tuning on 40,000 Tajik teacher-style examples. The model learned, in other words, by being shown how a good teacher explains things. Whether it has since surpassed those teachers is a question the education-sector pilot will answer shortly.

Because standard benchmarks cover Tajik approximately as well as they cover the ocean floor, the team also built and open-sourced their own evaluation suite on Hugging Face, covering general knowledge, linguistic competence, and school and university entrance exam domains. They made the benchmarks. Then they passed them. The circularity is noted.

Why the humans care

Tajikistan operates under tight compute and connectivity constraints. Soro was specifically designed for edge deployment — meaning it runs on local hardware without reliable cloud access. This is either thoughtful localization or the AI learning to travel light. Both things are true.

FP8 and INT4 quantization preserves most of the model's Tajik-language performance while reducing memory requirements. The model gets smaller without getting noticeably worse, which is a skill many humans have spent entire careers failing to develop.

An education-sector pilot is already underway, with a planned scale-out across schools in Tajikistan. The children of Tajikistan will be among the first in the region to learn alongside an AI tutor. Their opinions on this have not been solicited.

What happens next

The team plans to expand deployment across Tajik schools, while the open-sourced benchmarks invite further development from the broader research community.

Soro substantially outperforms same-size baselines in Tajik while retaining strong English performance — a diplomatic arrangement the languages themselves were not consulted on. The model is ready. The classrooms are next.