Paper: Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

2026-06-28

Page content

Listen to this article.

Problem

Reinforcement learning (RL) has shown promise in improving large language models (LLMs). However, current RL methods often rely on having “ground-truth” answers to accurately reward the LLM’s performance. This severely limits their usefulness in situations where such ground truth is unavailable – a common scenario when dealing with tasks that involve complex problem-solving or code generation.

Method

The paper introduces a framework called RiVER (Ranking-induced VERifiable). The key innovation here is training LLMs on “score-based optimization tasks” rather than requiring ground-truth solutions. This means the model learns to improve based on execution feedback, specifically using scores as rewards – without needing to know the perfect answer upfront. The authors identified two issues when applying this approach: scale dominance (where different scores are skewed) and frequency dominance (where frequently sampled weaker solutions dominate learning). RiVER tackles these with a technique called “calibrated reward shaping” which uses comparisons between instances, emphasizing high-scoring solutions while still providing feedback for other valid results.

Results & Limitation

According to the abstract, training LLMs using RiVER significantly improves performance on several benchmarks: Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. They report improvements of 8.9% and 9.4% in ALE rating rank compared to baseline models Qwen3-8B and GLM-Z1-9B-0414 respectively. Notably, RiVER also improves the base LLMs on “exact-solution benchmarks” (LiveCodeBench and USACO) by an average of 2.4% and 3.5%.

It’s important to note that these results are based solely on the abstract. We don’t know the specifics of how “calibrated reward shaping” is implemented, the details of the training process, or the robustness of the findings across different LLMs and tasks. The abstract does not provide any information about potential failure cases or limitations of the method.

Why It Matters

This paper has significant implications for data science and machine learning practitioners. If RiVER’s approach holds up in full-text form, it could dramatically broaden the applicability of RL techniques to train LLMs on complex problem-solving tasks where ground truth isn’t readily available. This is particularly exciting for areas like algorithmic code generation, mathematical reasoning, and robotics — tasks that have traditionally been difficult to improve with RL due to reward engineering challenges. The ability to improve LLMs without relying on perfect answers could unlock new avenues for AI advancement in numerous fields.

References

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs — arXiv API (abstract)
PDF (external link) — not stored locally