ERQA+: An Enhanced Benchmark on Embodied Reasoning

BAAI FlagEval Team

ERQA+: An Enhanced Benchmark on Embodied Reasoning

BAAI FlagEval Team

Paper (Preview) Code (Coming Soon)

Dataset Hub Reference

TL;DR

We introduce ERQA+, a new embodied reasoning benchmark that complements ERQA (from Gemini Robotics Team) in multiple aspects:

Egocentric scenes: ERQA+ is specifically adapted to embodied perspectives by focusing on egocentric scenes.
Reduced contamination: ERQA+ is newly created via manual annotation over recent robotic videos instead of reusing earlier VQA datasets.
Extended taxonomy: ERQA+ offers a finer-grained taxonomy that covers additional embodied reasoning skills.
Question types: ERQA+ spans multiple question types and increases the number of options for multi-choice questions.
Difficulty: Multi-stage filtering excludes trivial or easy samples to keep evaluation challenging.

Example prompts spanning perception, prediction, spatial reasoning, and planning tasks included in ERQA+.

Why ERQA+

Rapid progress in large language and vision-language models has enabled robotic agents that reason about the world, yet evaluating nuanced embodied capabilities remains hard. Existing benchmarks emphasize text-centric tasks or re-use legacy data that does not capture egocentric perspectives, leading to leakage and shallow reasoning shortcuts.

ERQA+ addresses these gaps by offering a newly collected benchmark dedicated to embodied reasoning. Built on top of the original ERQA spirit yet significantly extended, ERQA+ delivers harder questions, more granular taxonomy, and contamination-resistant samples for stress-testing modern VLMs.

What's New

Extended Taxonomy

Fine-grained embodied reasoning skills, covering planning, prediction, perception, and spatial reasoning with detailed subcategories.

Egocentric Focus

All questions are annotated on egocentric video frames, grounding evaluation in first-person robot views instead of generic internet imagery.

Stricter Filtering

Multi-round quality control with LLM-only checks, sparse caption filters, and category-aware balancing to eliminate shortcuts.

Benchmark Statistics

Metric	Value
Total Questions	800
Multiple-choice	283 (35.4%)
Sorting	214 (26.8%)
Matching	101 (12.6%)
Counting & Numeric	121 (15.1%)
Composite-judgment	64 (8.0%)
Open	17 (2.1%)

Distribution of task categories — Category distribution across perception, planning, prediction, and spatial reasoning disciplines.

Experimental Results

We evaluate a range of proprietary and open-source VLMs. State-of-the-art models still face large gaps, especially on planning and spatial reasoning, underscoring the remaining challenges in embodied understanding.

Model	Perception	Planning	Prediction	Spatial Reasoning	ERQA+ (All)	ERQA (All)
Gemini-3-Pro-preview	53.5	49.0	74.0	59.1	57.3	66.0
Gemini-2.5-Pro	33.6	37.0	54.0	33.3	35.1	57.3
Gemini-2.5-Flash	32.3	22.0	48.0	26.6	28.9	53.3
GPT-5	42.4	38.0	56.0	46.7	45.0	59.3
GPT-5-Mini	34.1	22.0	40.0	39.3	35.8	53.8
Qwen3-VL-235b-a22b	33.2	34.0	48.0	32.8	34.0	49.5
GPT-5-Nano	18.9	12.0	18.0	21.2	19.3	45.8
Qwen3-VL-8B	23.5	14.0	32.0	21.0	21.5	41.8
Gemma3-12B	15.7	8.0	14.0	13.9	13.6	36.8
Phi-4-Multimodal	18.4	11.0	14.0	14.3	15.0	36.0

Higher numbers indicate better accuracy (%). ERQA+ exposes larger gaps than the original ERQA.

Conclusion

We introduce ERQA+, a new benchmark on embodied reasoning that addresses several limitations from another earlier benchmark ERQA. Our evaluation results show magnified gaps between smaller models and larger models, while also showing a bigger room for improvement in embodied reasoning. We hope this resource could help future development of embodied vision-language models with improved spatial reasoning, planning, and event understanding capabilities as targets.

BibTeX

@misc{baai2025erqaplus,
  title={ERQA+: An Enhanced Benchmark on Embodied Reasoning},
  author={BAAI FlagEval Team},
  year={2025},
  howpublished={\url{https://flageval-baai.github.io/ERQA-Plus-page/}},
  note={Preprint}
}