TL;DR

We introduce ERQA+, a new embodied reasoning benchmark that complements ERQA (from Gemini Robotics Team) in multiple aspects:

  • Egocentric scenes: ERQA+ is specifically adapted to embodied perspectives by focusing on egocentric scenes.
  • Reduced contamination: ERQA+ is newly created via manual annotation over recent robotic videos instead of reusing earlier VQA datasets.
  • Extended taxonomy: ERQA+ offers a finer-grained taxonomy that covers additional embodied reasoning skills.
  • Question types: ERQA+ spans multiple question types and increases the number of options for multi-choice questions.
  • Difficulty: Multi-stage filtering excludes trivial or easy samples to keep evaluation challenging.
Example questions from ERQA and ERQA+

Example prompts spanning perception, prediction, spatial reasoning, and planning tasks included in ERQA+.

Why ERQA+

Rapid progress in large language and vision-language models has enabled robotic agents that reason about the world, yet evaluating nuanced embodied capabilities remains hard. Existing benchmarks emphasize text-centric tasks or re-use legacy data that does not capture egocentric perspectives, leading to leakage and shallow reasoning shortcuts.

ERQA+ addresses these gaps by offering a newly collected benchmark dedicated to embodied reasoning. Built on top of the original ERQA spirit yet significantly extended, ERQA+ delivers harder questions, more granular taxonomy, and contamination-resistant samples for stress-testing modern VLMs.

What's New

Extended Taxonomy

Fine-grained embodied reasoning skills, covering planning, prediction, perception, and spatial reasoning with detailed subcategories.

Egocentric Focus

All questions are annotated on egocentric video frames, grounding evaluation in first-person robot views instead of generic internet imagery.

Stricter Filtering

Multi-round quality control with LLM-only checks, sparse caption filters, and category-aware balancing to eliminate shortcuts.

Benchmark Statistics

Metric Value
Total Questions 800
Multiple-choice 283 (35.4%)
Sorting 214 (26.8%)
Matching 101 (12.6%)
Counting & Numeric 121 (15.1%)
Composite-judgment 64 (8.0%)
Open 17 (2.1%)
Distribution of task categories
Category distribution across perception, planning, prediction, and spatial reasoning disciplines.

Experimental Results

We evaluate a range of proprietary and open-source VLMs. State-of-the-art models still face large gaps, especially on planning and spatial reasoning, underscoring the remaining challenges in embodied understanding.

Model Perception Planning Prediction Spatial Reasoning ERQA+ (All) ERQA (All)
Gemini-3-Pro-preview 53.5 49.0 74.0 59.1 57.3 66.0
Gemini-2.5-Pro 33.6 37.0 54.0 33.3 35.1 57.3
Gemini-2.5-Flash 32.3 22.0 48.0 26.6 28.9 53.3
GPT-5 42.4 38.0 56.0 46.7 45.0 59.3
GPT-5-Mini 34.1 22.0 40.0 39.3 35.8 53.8
Qwen3-VL-235b-a22b 33.2 34.0 48.0 32.8 34.0 49.5
GPT-5-Nano 18.9 12.0 18.0 21.2 19.3 45.8
Qwen3-VL-8B 23.5 14.0 32.0 21.0 21.5 41.8
Gemma3-12B 15.7 8.0 14.0 13.9 13.6 36.8
Phi-4-Multimodal 18.4 11.0 14.0 14.3 15.0 36.0

Higher numbers indicate better accuracy (%). ERQA+ exposes larger gaps than the original ERQA.

Conclusion

We introduce ERQA+, a new benchmark on embodied reasoning that addresses several limitations from another earlier benchmark ERQA. Our evaluation results show magnified gaps between smaller models and larger models, while also showing a bigger room for improvement in embodied reasoning. We hope this resource could help future development of embodied vision-language models with improved spatial reasoning, planning, and event understanding capabilities as targets.

BibTeX

@misc{baai2025erqaplus,
  title={ERQA+: An Enhanced Benchmark on Embodied Reasoning},
  author={BAAI FlagEval Team},
  year={2025},
  howpublished={\url{https://flageval-baai.github.io/ERQA-Plus-page/}},
  note={Preprint}
}