Reasonbench Quantifies LLM Reasoning Instability with Multi-Run Evaluation and Standardized Frameworks

Summarize this article with:
Large language models are increasingly relied upon for complex reasoning tasks, such as multi-step problem solving, yet current methods for evaluating their performance often overlook a crucial factor: consistency. Nearchos Potamitis, Lars Klein, and Akhil Arora, from Aarhus University and EPFL, address this gap with ReasonBENCH, a new benchmark designed to quantify the stability of LLM reasoning. This work introduces a standardised evaluation library and a rigorous multi-run protocol, allowing researchers to assess not only the quality of an answer, but also the reliability and cost-effectiveness of achieving it. Across a range of tasks, the team demonstrates that many reasoning strategies exhibit surprisingly high instability, with even top-performing methods often incurring unpredictable costs, and this research establishes reproducibility as a critical factor for trustworthy performance in large language models. Current evaluations typically report single-run accuracy, overlooking the crucial issue of consistency; even powerful LLMs can produce different answers to the same question. ReasonBENCH addresses this gap by quantifying how reliably LLMs arrive at the same answer when presented with the same problem repeatedly. The benchmark incorporates a diverse set of reasoning tasks, including commonsense reasoning, symbolic manipulation, mathematical problem-solving, and scientific analysis, and assesses performance across multiple runs with slight variations in prompting. The findings reveal that LLMs can be surprisingly unreliable, exhibiting significant variability in their reasoning even with the same input. Importantly, increasing model size does not guarantee stability; larger models are not necessarily more consistent. LLMs also demonstrate sensitivity to even minor changes in prompting, which can lead to differing answers. The research emphasizes the need for LLMs to express their uncertainty, allowing users to better assess the reliability of their outputs. Ultimately, this work advocates for a shift in evaluation metrics, moving beyond simple accuracy to include measures of stability and reliability, and ReasonBENCH serves as a tool to facilitate this change and drive research towards more trustworthy AI systems. ReasonBENCH, A Reproducibility Benchmark for LLMs Researchers pioneered ReasonBENCH, a novel benchmark designed to rigorously quantify instability in LLM reasoning. Recognizing that current evaluations primarily report single-run accuracy, they developed a system to address the critical gap in assessing reproducibility and consistency. ReasonBENCH achieves this through a modular evaluation library that standardizes reasoning frameworks, models, and tasks, enabling fair comparisons. Crucially, the work introduces a multi-run protocol, systematically executing each method multiple times and reporting statistically reliable metrics for both solution quality and computational cost. This protocol focuses on quantifying the variance in performance across multiple runs, providing a more nuanced understanding of model reliability. Analyzing a diverse set of reasoning strategies and models across various domains, the team revealed that the vast majority exhibit high instability. Notably, even strategies achieving similar average performance can display confidence intervals up to four times wider, highlighting significant differences in consistency. Further investigation revealed that top-performing methods often incur higher and less stable costs, demonstrating a trade-off between solution quality and resource utilization. By systematically measuring and reporting variance, ReasonBENCH establishes reproducibility as a critical dimension for reliable LLM reasoning, laying a foundation for future advancements in reasoning methods and uncertainty quantification techniques.
Reasoning Instability Quantified Across Language Models This work presents ReasonBENCH, a new benchmark designed to rigorously quantify the instability inherent in LLM reasoning. Recognizing that current evaluation practices overwhelmingly report single-run accuracy, the team conducted an in-depth evaluation by running ten independent trials for each model, algorithm, and task combination. This approach allows for the reporting of not only mean performance but also crucial variance and confidence intervals, providing a more statistically reliable assessment of LLM reasoning capabilities. Experiments reveal significant instability across diverse tasks and models.
Results demonstrate that even reasoning strategies with similar average performance can exhibit confidence intervals up to four times wider, highlighting substantial variation in their reliability. Notably, top-performing methods frequently incur higher and less stable costs, indicating a trade-off between accuracy and consistency.
The team observed that this instability compromises reproducibility across runs, directly impacting the reliability of reported performance metrics. To facilitate reproducible research, the team released an agentic AI library integrating ten state-of-the-art reasoning algorithms with a framework for cost-efficient inference. This combination establishes reproducible baselines and enables the uncovering of instability in LLM reasoning strategies. The data confirms that assessing reproducibility is a critical dimension for reliable LLM reasoning, providing a foundation for future methods and uncertainty quantification techniques. LLM Reasoning Instability Revealed by ReasonBENCH This work introduces ReasonBENCH, a new benchmark designed to rigorously evaluate the stability of LLMs when performing reasoning tasks. Researchers developed a comprehensive AI library encompassing eleven distinct reasoning methods, tested across four different models and seven diverse tasks. A key innovation lies in the multi-run evaluation protocol, where each model-algorithm-task combination undergoes ten independent trials, yielding statistically reliable estimates of both accuracy and cost, complete with confidence intervals. The results of this systematic evaluation reveal a significant degree of instability in current LLM reasoning strategies. Even methods exhibiting similar average performance can display substantial variation, with confidence intervals differing by a factor of four. Furthermore, the highest-performing methods often incur greater and less predictable costs. This instability compromises the reproducibility of results and raises concerns about the reliability of reported performance metrics.
The team also investigated the impact of model scale and identified sources of variance, providing insights into the trade-offs between solve rate and stability. The authors acknowledge that the benchmark focuses on a specific set of reasoning tasks and models, and future work could broaden this scope. Nevertheless, ReasonBENCH establishes a foundation for more robust evaluation of LLM reasoning, highlighting reproducibility as a critical dimension for trustworthy performance. The publicly available library and evaluation framework empower researchers to develop and assess new reasoning methods with statistically sound metrics, ultimately advancing the field of artificial intelligence. 👉 More information 🗞 ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning 🧠ArXiv: https://arxiv.org/abs/2512.07795 Tags:
