A Preliminary Contamination-Free Evaluation of Reasoning Models
LRM-Eval

Introduction
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. To highlight a few:
- We observe concerning signals of misaligned thinking and answers, more or less appearing on all LRMs we have investigated: the actual answer given in the model response may differ from what has been concluded in the reasoning process. It has also been prevalent that the reasoning process implies clear uncertainty but the LRM finally gives a very deterministic answer. Even many of the top-tier LRMs do not seem to know when to abstain.
- Many top-tier LRMs may pretend to have used an external tool or conducted a web search during reasoning even when they do not have real access, leaving a big question mark on credibility and reliability. We appeal for more transparency in revealing more reasoning details to enable sufficient awareness during usage, especially for conversations involving multimodal reasoning.
- Current open-weight LRMs may tend to show more vulnerability against harmful content prompts or jailbreaking, implying necessity of more careful deployment.
- Some recent findings on LRMs (versus non-thinking counterparts) might be model-specific or data-specific. For instance, we observe degradation in (verifiable) instruction following only on Claude Sonnet 4 and DeepSeek series, but more LRMs show weaknesses in multi-turn settings.
- Text-based inference-time scaling has not yet brought as notable gains on visual reasoning.
- Performance varies too much for generally difficult subsets, which implies a big challenge in conducting statistically reliable evaluation at moderate cost.
- Different model developers might have been prioritizing things differently: GPT-5 series comprehensively show superiority in textual problem solving. On visual questions (our new benchmark named ROME), Gemini 2.5 Pro marginally tops in overall accuracy, o4-mini and GPT-5 strike a better balance with token consumption, while Claude Sonnet 4 is showing relatively the best controlled thinking behaviors overall.
We evaluate 30+ large reasoning models on textual and visual reasoning tasks(4 runs). Scatter plots of mean±std on overall averaged accuracy scores and token consumption for textual (first) and visual (second) tasks.

Scatter plots of mean±std on overall averaged accuracy scores and token consumption for visual tasks.

Leaderboard
Academic
College-level questions from course and lecture materials across STEM, humanities, and social sciences.
NYT Connections
The Connections game by The New York Times.
NPR Word Puzzles
New puzzles emulating the style of the NPR Sunday Puzzle.
Deciphering
Decipher text containing encrypted or hidden information.
LeetCode
Coding problems from recent weekly and biweekly LeetCode contests.
Instruction Following
Generated, verifiable instructions with few-shot examples from IFEval.
Multi-turn Instruction Following
Includes reminders and triggers, role-playing, and explaining concepts in prescribed ways.
Long-context Queries
Manually written questions requiring understanding of long arXiv papers (LaTeX source).
Factuality and Abstention
Long-tailed knowledge that is very infrequent in web-scale corpora.
Evaluation Metrics: Overall scores are not available for textual tasks due to the use of different evaluation metrics across benchmarks. Visual task accuracy is computed using multiple rule-based evaluators—please refer to our GitHub repository for details.
Rank | Model | Organization | Accuracy ± Std (avg@4) | Link |
---|---|---|---|---|
See our technical report for more details. |
Citation
@misc{qin2025flageval, title={FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions}, author={Bowen Qin and Chen Yue and Fang Yin and Hui Wang and JG Yao and Jiakang Liu and Jing-Shu Zheng and Miguel Hu Chen and Richeng Xuan and Shibei Meng and Shiqi Zhou and Teng Dai and Tong-Shuai Ren and Wei Cui and Xi Yang and Xialin Du and Xiaojing Xu and Xue Sun and Xuejing Li and Yaming Liu and Yesheng Liu and Ying Liu and Yonghua Lin and Yu Zhao and Yunduo Zhang and Yuwen Luo and Zheqi He and Zhiyuan He and Zhongyuan Wang}, year={2025}, eprint={2509.17177}, archivePrefix={arXiv}, primaryClass={cs.CL} }