ERQA+: An Enhanced Benchmark on Embodied Reasoning
TL;DR
We introduce ERQA+, a new embodied reasoning benchmark that complements ERQA (from Gemini Robotics Team) in multiple aspects:
- Egocentric scenes: ERQA+ is specifically adapted to embodied perspectives by focusing on egocentric scenes.
- Reduced contamination: ERQA+ is newly created via manual annotation over recent robotic videos instead of reusing earlier VQA datasets.
- Extended taxonomy: ERQA+ offers a finer-grained taxonomy that covers additional embodied reasoning skills.
- Question types: ERQA+ spans multiple question types and increases the number of options for multi-choice questions.
- Difficulty: Multi-stage filtering excludes trivial or easy samples to keep evaluation challenging.
Why ERQA+
Rapid progress in large language and vision-language models has enabled robotic agents that reason about the world, yet evaluating nuanced embodied capabilities remains hard. Existing benchmarks emphasize text-centric tasks or re-use legacy data that does not capture egocentric perspectives, leading to leakage and shallow reasoning shortcuts.
ERQA+ addresses these gaps by offering a newly collected benchmark dedicated to embodied reasoning. Built on top of the original ERQA spirit yet significantly extended, ERQA+ delivers harder questions, more granular taxonomy, and contamination-resistant samples for stress-testing modern VLMs.
Experimental Results
We evaluate a range of proprietary and open-source VLMs. State-of-the-art models still face large gaps, especially on planning and spatial reasoning, underscoring the remaining challenges in embodied understanding.
| Model | Perception | Planning | Prediction | Spatial Reasoning | ERQA+ (All) | ERQA (All) |
|---|---|---|---|---|---|---|
| Gemini-3-Pro-preview | 53.5 | 49.0 | 74.0 | 59.1 | 57.3 | 66.0 |
| Gemini-2.5-Pro | 33.6 | 37.0 | 54.0 | 33.3 | 35.1 | 57.3 |
| Gemini-2.5-Flash | 32.3 | 22.0 | 48.0 | 26.6 | 28.9 | 53.3 |
| GPT-5 | 42.4 | 38.0 | 56.0 | 46.7 | 45.0 | 59.3 |
| GPT-5-Mini | 34.1 | 22.0 | 40.0 | 39.3 | 35.8 | 53.8 |
| Qwen3-VL-235b-a22b | 33.2 | 34.0 | 48.0 | 32.8 | 34.0 | 49.5 |
| GPT-5-Nano | 18.9 | 12.0 | 18.0 | 21.2 | 19.3 | 45.8 |
| Qwen3-VL-8B | 23.5 | 14.0 | 32.0 | 21.0 | 21.5 | 41.8 |
| Gemma3-12B | 15.7 | 8.0 | 14.0 | 13.9 | 13.6 | 36.8 |
| Phi-4-Multimodal | 18.4 | 11.0 | 14.0 | 14.3 | 15.0 | 36.0 |
Higher numbers indicate better accuracy (%). ERQA+ exposes larger gaps than the original ERQA.
BibTeX
@misc{baai2025erqaplus,
title={ERQA+: An Enhanced Benchmark on Embodied Reasoning},
author={BAAI FlagEval Team},
year={2025},
howpublished={\url{https://flageval-baai.github.io/ERQA-Plus-page/}},
note={Preprint}
}