Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

Yesheng Liu1,2,3 Hao Li3,4 Haiyu Xu3,5 Baoqi Pei6 Jiahao Wang1,2,3 Mingxuan Zhao3,5 Jing-Shu Zheng3 Zheqi He3 JG Yao3 Bowen Qin3 Xi Yang3 Jiajun Zhang3
1 Institute of Automation, CAS   2 School of Artificial Intelligence, UCAS   3 BAAI FlagEval Team   4 BUAA   5 PKU   6 ZJU  
ReVeL framework overview

Illustration of MCQA fragility. The example (left) shows an unfaithful reasoning chain that eliminates distractors incorrectly yet provide a correct final answer, yielding a positive reward signal that, when used in reinforcement learning, further amplifies shortcut behavior (top right). This shortcut behavior leads to widening gap between MCQA and OpenQA. The diagram motivate us to propose ReVeL, which aligns evaluation and training with reliable OpenQA.

Abstract

Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency.

Fragility of MCQA

1. Quantifying the non-robustness of MCQA

We find that evaluation via MCQA lacks robustness to trivial modifications of the options:

add option results

Models exploit cues and positional patterns, rather than retrieving or reasoing from actual knowledge.

benchmarks results

When options are removed or replaced by “None-of-the-Above”, performance collapses, and models often reason correctly but still choose the wrong option, exposing logical inconsistency.

2. RFT on MCQA hurts open-ended QA

Reinforcement fine-tuning (RFT) on MCQA improves MC accuracy but harms open-ended QA, widening the MCQA–OpenQA gap.

RFT on MCQA results

Impact of RFT on ViRL MCQA data. MCQ = multiple-choice benchmark score; Open = open-ended benchmark score. ∆ denotes the inflation gap (MCQ–Open). RFT on ViRL improves MCQ scores but enlarges ∆, indicating reinforced shortcut behavior.

Datasets rewritten by ReVeL improves both MCQ and Open performance

Open ended accuracy improves on every benchmarks while MCQA scores remain competitive. Models trained with ReVeL data achieves a higher overall score. These results indicate that verifiable OpenQA align better with transferable reasoning and real-world usage, improving both open-ended performance and the combined overall metric.

Revel examples

BibTeX

@article{liu2025ReVeL,
        title={Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training}, 
        author={Yesheng Liu and Hao Li and Haiyu Xu and Baoqi Pei and Jiahao Wang and Mingxuan Zhao and Jingshu Zheng and Zheqi He and JG Yao and Bowen Qin and Xi Yang and Jiajun Zhang},
        year={2025},
        eprint={2511.17405},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2511.17405}, 
  }