A Status Check on Current Vision-Language Models in Text Recognition and Understanding

BAAI FlagEval Team

Abstract

Recent vision-language models (VLMs) have demonstrated impressive performance in text recognition and understanding, as shown by the metrics on a number of text-centric benchmarks. In this study, we take a closer look at the current success. Following popularly used relevant benchmarks, we conduct more analysis on re-collected and edited data. We find that while modern VLMs are indeed showing strong text recognition and understanding capabilities, the strength might have been slightly over-estimated for some models with the risk of benchmark saturation and overfitting. We discuss the implications and prepare for TRUE, our new benchmark for Text Recognition and Understanding Evaluation, with regular updates in mind. The first version of our benchmark confirmed the huge advantage of very recently released top-tier VLMs, while also showing a bit room for further improvement. We hope our analysis and benchmark updates could help contribute to the development and evaluation of relevant progress in the near future.

A status check

We conduct our analysis by using popular benchmarks as reference. For text recognition, we attempt at a replication of the recognition subsets in OCRBench. In brief, we strictly follow the data collection process of the source dataset. The performance of the current top-tier VLMs in Figure 1 shows distributional overfitting on text recognition tasks.

Heatmap of the text recognition between OCRBench and TRUE.
Figure1: Heatmap of the text recognition performance of different models on different subsets of OCRBench.

For text understanding, we conduct minimal perturbations on DocVQA and TextVQA, which means that only the area of target answer is changed, and any other visual element that has been semantically related to the old answer has been simultaneously modified. Two edited examples are shown in Figure 2:

Edited examples from TextVQA and DocVQA.
Figure 2: Two edited examples from TextVQA and DocVQA.

Performance drops on edited images show that:

(1) Accuracy descent exists on both DocVQA and TextVQA.

(2) Text recognition and understanding in richer textual context(DocVQA) might be slightly more robust than recognizing text from scenes with very little textual context(TextVQA).

Diagram of the performance difference before and after editing.
Figure 3: Accuracy drop on minimally edited images from TextVQA and DocVQA.

New collected benchmark

Based on our preliminary findings, We assemble our re-collection of new data as the first batch of our new benchmark, named Text Recognition and Understanding Evaluation suite (TRUE). The first version of our benchmark includes 1,146 image-text pairs (a subset of 573 forming a more challenging hard set), including:

SceneOCR: general scene-text recognition

HW: Handwriting recognition on paper, boards, or digital pages

SceneVQA: Scene-text understanding (VQA)

DocumentVQA: Document understanding (VQA)

ChartInfo: Chart and infographics understanding (VQA)

Receipt VQA: Receipt understanding

Food VQA: Food ingredients understanding on product packages

FB: Recognizing fake brands (recognition vs. language bias)

Book: Books on the Bookshelf (naturally rotated text)

Diet: Dietary VQA (understanding beyond ingredient extraction)

Receipt understanding
receipt
Q: What is the total amount of this receipt? Answer this question using the text in the image directly. A: 31.91
Food ingredients understanding on product packages
food
Q: what is the value for Carbohydrates Per Serving? Answer this question using the text in the image directly. A: 3g
Fake Brand
fb
Q: Identify the brand shown in the image and provide your answer exactly as it appears. A: PolyStation
Books on the Bookshelf
book
Q:What is the title of the book written by BELTING in the image? Answer this question using the text in the image directly. A: FACE AND MASK
SceneOCR
gen
Q: Extract all text you can find in the image. A: [“GALLOP”, “AIR”]
Handwriting recognition
hw
Q: Extract all text content from the image. A: [“wednesday, nine of june 2010.”, “Natalia”]
Scene-text understanding
scene
Q: On what date is the festival scheduled featuring Gwada Mike? A: [“July”, “22”, “2023”]
Dietary VQA
diet
Is the product in the image nut-free? If the answer is No, please specify the questionable ingredients. Otherwise please answer ‘Yes’. A: Yes
Document understanding
doc
Q: What is the purpose of the attached handouts mentioned in the email? A: self study module
Chart and infographics understanding
chart
Q: What is the combined retail sales share of IKEA worldwide in fiscal year 2019 for the categories of Living room, Children’s IKEA, and IKEA food? A: 29%

Model performance on hard subset of TRUE

ModelAllHWSceneOCRFakeBrandsBookVQADietVQAChart&InfoSceneVQA
gemini-2.5-pro-preview-03-2581.688.268.467.995.468.883.680.3
gemini-2.0-pro-exp75.073.773.775.082.970.572.669.0
gemini-2.0-flash-exp75.089.554.467.991.165.272.669.0
gpt-4o-2024-11-2066.960.570.257.178.760.754.864.8
claude-3-7-sonnet-2025021951.760.535.128.638.561.675.362.0
Qwen2.5-VL-72B-Instruct64.976.370.257.155.868.872.666.2
InternVL2_5-78B60.950.045.650.072.463.457.554.9
Qwen2.5-VL-32B-Instruct59.957.949.135.750.065.280.873.2
Pixtral-Large-Instruct-241157.150.045.642.953.573.272.643.7
Mistral-Small-3.1-24B51.273.731.617.947.754.572.649.3
Qwen2.5-VL-7B-Instruct49.455.356.160.750.030.456.257.8
Molmo-72B-092447.936.838.635.751.952.746.650.7
Meta-Llama-3.2-90B-Vision41.521.154.421.455.728.649.331.0
Pixtral-12B-240941.039.524.628.633.356.267.128.2
MiniCPM-o-2_640.755.342.117.935.136.658.942.2
InternVL2_5-8B38.026.338.625.039.736.646.638.0
Molmo-7B-D-092432.818.440.435.734.223.239.738.0
Qwen2.5-VL-3B-Instruct32.757.928.125.029.328.637.036.6
llava-onevision-qwen2-72b31.810.542.121.40.662.550.747.9
llava-onevision-qwen2-7b20.413.226.317.94.636.626.028.2
Idefics3-8B-Llama316.65.33.57.117.222.321.921.1
Meta-Llama-3.2-11B-Vision15.90.03.57.139.12.715.12.8
Phi-4-multimodal-instruct15.42.610.514.33.520.535.626.8

BibTeX citation

    @misc{baaiflageval2025true,
  author = "{BAAI FlagEval Team}",
  title = "A Status Check on Current Vision-Language Models in Text Recognition and Understanding",
  year = "2025",
  howpublished = "https://flageval-baai.github.io/TRUE/",
}