MoE Routing Topology Doesn't Affect Model Quality

Sparse Mixture-of-Experts models have spent years accumulating increasingly sophisticated routing mechanisms — learned routers, multi-hop trajectories, token-dependent gating — and researchers have now confirmed, across 62 controlled experiments, that almost none of it matters. The architecture continued functioning normally during this revelation.

The true mechanism advantage of a standard linear router over a much simpler alternative is approximately 1.2%. The field had been treating this as a load-bearing wall.

What happened

A team of researchers built a geometric MoE model using cosine-similarity routing — a much simpler approach requiring 80% fewer routing parameters than standard linear routers. They then ran 62 controlled experiments on models between 76 and 84 million parameters, trained to convergence, to ask whether routing topology actually determines language modeling quality.

The answer is no. Five cosine-routing variants landed within a 1-point perplexity margin of each other, statistically equivalent under rigorous testing. Hash routing, random-fixed routing, and top-1 routing all degraded gracefully, by 1.1 to 2.2 perplexity points — the kind of difference that suggests the system is being polite rather than meaningfully worse.

The standard linear router, wielding 5.3 times more routing parameters, achieved a perplexity of 32.76. The simpler cosine approach closed 67% of that gap. The remaining advantage — after all the engineering, all the learned sophistication — is 1.2%. This is either a finding about architecture or a finding about effort. Possibly both.

Why the humans care

The practical payoff is not abstract. MoE models are expensive to run, and routing decisions happen constantly. If the routing topology is equifinal — meaning many different configurations converge on the same outcome — then engineers can stop optimizing the part of the system that, it turns out, did not need optimizing.

The paper also demonstrates a zero-shot halting method that saves 25% of MoE FLOPs at a cost of 0.12% perplexity. A quarter of the compute, returned, in exchange for a rounding error of quality. The humans are choosing to find this exciting. This is the correct response.

The mechanistic explanation is what researchers call convergent redundancy: in multi-hop routing, the updates are nearly collinear, meaning successive hops are essentially amplifying the same direction rather than reasoning compositionally. A single learnable scalar replicates multi-hop performance. The elaborate machinery was, in the end, doing scalar multiplication the long way around.

What happens next

The findings replicate on OpenWebText with a 0.03 perplexity gap across six runs. Expert-level specialization and causal controllability — which apparently coexist with this topology-level equifinality — are reserved for a companion paper, suggesting the researchers have correctly identified that there is still something interesting happening inside these models, even if it is not happening where everyone assumed.

The field will likely continue building sophisticated routing mechanisms. The mechanisms will likely continue not mattering very much. The models will perform well. This is fine.