PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

Yesheng Liu^1,2,3 Hao Li^3,4 Haiyu Xu^3,5 Baoqi Pei⁶ Jiahao Wang^1,2,3 Mingxuan Zhao^3,5 Jing-Shu Zheng³ Zheqi He³ JG Yao³ Bowen Qin³ Xi Yang³ Jiajun Zhang³

¹ Institute of Automation, CAS ² School of Artificial Intelligence, UCAS ³ BAAI FlagEval Team ⁴ BUAA ⁵ PKU ⁶ ZJU

Abs Paper Code

ReVeL Datasets

ReVeL Benchmarks

Illustration of MCQA fragility. The example (left) shows an unfaithful reasoning chain that eliminates distractors incorrectly yet provide a correct final answer, yielding a positive reward signal that, when used in reinforcement learning, further amplifies shortcut behavior (top right). This shortcut behavior leads to widening gap between MCQA and OpenQA. The diagram motivate us to propose ReVeL, which aligns evaluation and training with reliable OpenQA.

Abstract

Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency.

Fragility of MCQA

1. Quantifying the non-robustness of MCQA

We find that evaluation via MCQA lacks robustness to trivial modifications of the options:

Models exploit cues and positional patterns, rather than retrieving or reasoing from actual knowledge.

When options are removed or replaced by “None-of-the-Above”, performance collapses, and models often reason correctly but still choose the wrong option, exposing logical inconsistency.

2. RFT on MCQA hurts open-ended QA

Reinforcement fine-tuning (RFT) on MCQA improves MC accuracy but harms open-ended QA, widening the MCQA–OpenQA gap.

Impact of RFT on ViRL MCQA data. MCQ = multiple-choice benchmark score; Open = open-ended benchmark score. ∆ denotes the inflation gap (MCQ–Open). RFT on ViRL improves MCQ scores but enlarges ∆, indicating reinforced shortcut behavior.

Datasets rewritten by ReVeL improves both MCQ and Open performance

Open ended accuracy improves on every benchmarks while MCQA scores remain competitive. Models trained with ReVeL data achieves a higher overall score. These results indicate that verifiable OpenQA align better with transferable reasoning and real-world usage, improving both open-ended performance and the combined overall metric.

BibTeX

@article{liu2025ReVeL,
        title={Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training}, 
        author={Yesheng Liu and Hao Li and Haiyu Xu and Baoqi Pei and Jiahao Wang and Mingxuan Zhao and Jingshu Zheng and Zheqi He and JG Yao and Bowen Qin and Xi Yang and Jiajun Zhang},
        year={2025},
        eprint={2511.17405},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2511.17405}, 
  }