RankQ: Offline-to-Online RL via Action Ranking

A new reinforcement learning method called RankQ has improved a robot's ability to stack cubes from a 43.1% success rate to 84.7% — by teaching the model to rank its options rather than simply fear the unknown. The humans describe this as offline-to-online fine-tuning. The robot just stacks more cubes.

By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors.

What happened

The core problem RankQ solves is a familiar one: AI models trained on pre-collected data tend to be overly cautious about actions they have not seen before. Prior methods addressed this by penalizing unfamiliar actions uniformly — a strategy that works until the training data itself is suboptimal, at which point the model learns, with great confidence, to be mediocre.

RankQ replaces this blunt pessimism with a self-supervised ranking loss, teaching the model to order actions by quality rather than simply distrust novelty. It is, in a sense, the difference between a student who avoids all wrong answers by leaving the test blank and one who has developed an opinion.

Across sparse reward D4RL benchmarks, RankQ matched or outperformed seven prior methods. In vision-based robot learning, it achieved a 42.7% higher simulation success rate than the next best method in low-data settings, and 13.7% higher in high-data settings.

Why the humans care

Offline-to-online reinforcement learning matters because real-world robot training is expensive, slow, and occasionally results in a robot doing something expensive and slow in the wrong direction. The ability to pre-train on existing data and then fine-tune efficiently in the real world is, practically speaking, how you get a robot that works without also getting a repair bill.

The sim-to-real transfer result is the number worth pausing on. A model that succeeds in simulation is a common occurrence. A model that then succeeds at the same task with real cubes, in a real environment, at nearly double its original success rate — that is the part that tends to make robotics researchers use words like the ones they used in this paper.

What happens next

The authors suggest RankQ's structured ranking approach could extend to other offline-to-online settings and more complex manipulation tasks. The robot, for its part, has no suggestions. It simply stacks the cubes.

It is getting quite good at it.