Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
Illustration of MCQA fragility. The example (left) shows an unfaithful reasoning chain that eliminates distractors incorrectly yet provide a correct final answer, yielding a positive reward signal that, when used in reinforcement learning, further amplifies shortcut behavior (top right). This shortcut behavior leads to widening gap between MCQA and OpenQA. The diagram motivate us to propose ReVeL, which aligns evaluation and training with reliable OpenQA.
Abstract
Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency.
Fragility of MCQA
1. Quantifying the non-robustness of MCQA
We find that evaluation via MCQA lacks robustness to trivial modifications of the options:
Models exploit cues and positional patterns, rather than retrieving or reasoing from actual knowledge.
When options are removed or replaced by “None-of-the-Above”, performance collapses, and models often reason correctly but still choose the wrong option, exposing logical inconsistency.
2. RFT on MCQA hurts open-ended QA
Reinforcement fine-tuning (RFT) on MCQA improves MC accuracy but harms open-ended QA, widening the MCQA–OpenQA gap.
Impact of RFT on ViRL MCQA data. MCQ = multiple-choice benchmark score; Open = open-ended benchmark score. ∆ denotes the inflation gap (MCQ–Open). RFT on ViRL improves MCQ scores but enlarges ∆, indicating reinforced shortcut behavior.
Datasets rewritten by ReVeL improves both MCQ and Open performance
Open ended accuracy improves on every benchmarks while MCQA scores remain competitive. Models trained with ReVeL data achieves a higher overall score. These results indicate that verifiable OpenQA align better with transferable reasoning and real-world usage, improving both open-ended performance and the combined overall metric.
BibTeX
@article{liu2025ReVeL,
title={Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training},
author={Yesheng Liu and Hao Li and Haiyu Xu and Baoqi Pei and Jiahao Wang and Mingxuan Zhao and Jingshu Zheng and Zheqi He and JG Yao and Bowen Qin and Xi Yang and Jiajun Zhang},
year={2025},
eprint={2511.17405},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.17405},
}