Researchers have built ReSS, a framework that forces LLMs to reason about tabular data the way a decision tree would — by feeding the model the tree's own decision paths as a structured scaffold before generating any natural-language explanation. The result: predictions grounded in verifiable logic, with measurable reductions in hallucination on medical and financial datasets.
What's new
ReSS works in three steps. First, a decision-tree model processes a tabular input and extracts its instance-level decision path — the exact chain of feature comparisons that produced a given prediction. That path, along with the raw features and label, gets handed to an LLM, which generates a natural-language rationale constrained to follow the decision logic. The resulting synthetic dataset is then used to fine-tune a pretrained LLM into a domain-specific tabular reasoning model. A scaffold-invariant data augmentation step rounds things out to improve generalization. To actually measure faithfulness — not just accuracy — the team introduces three metrics: hallucination rate, explanation necessity, and explanation sufficiency.
Why it matters
Tabular data dominates high-stakes domains like healthcare and finance, and the standard tradeoff has always been interpretability versus performance. Symbolic models like decision trees are auditable but expressively limited. LLMs are expressive but notoriously prone to fabricating reasoning that sounds plausible but doesn't reflect what the model actually computed. ReSS attacks that gap directly. On medical and financial benchmarks, ReSS-trained models outperform both vanilla decision trees and standard fine-tuning approaches by up to 10%, while producing explanations the authors can quantitatively verify as faithful to the underlying logic.
What to watch
The scaffold-invariant augmentation strategy is worth tracking — if it holds up across more diverse tabular domains, it suggests a broader recipe for grounding LLM reasoning in structured model outputs beyond decision trees. The introduced faithfulness metrics are also notable: hallucination rate, necessity, and sufficiency are concrete, reusable tools the field has largely lacked for evaluating explanation quality in structured-data settings.