A Status Check on Current Vision-Language Models in Text Recognition and Understanding

Abstract

Recent vision-language models (VLMs) have demonstrated impressive performance in text recognition and understanding, as shown by the metrics on a number of text-centric benchmarks. In this study, we take a closer look at the current success. Following popularly used relevant benchmarks, we conduct more analysis on re-collected and edited data. We find that while modern VLMs are indeed showing strong text recognition and understanding capabilities, the strength might have been slightly over-estimated for some models with the risk of benchmark saturation and overfitting. We discuss the implications and prepare for TRUE, our new benchmark for Text Recognition and Understanding Evaluation, with regular updates in mind. The first version of our benchmark confirmed the huge advantage of very recently released top-tier VLMs, while also showing a bit room for further improvement. We hope our analysis and benchmark updates could help contribute to the development and evaluation of relevant progress in the near future.

A status check

We conduct our analysis by using popular benchmarks as reference. For text recognition, we attempt at a replication of the recognition subsets in OCRBench. In brief, we strictly follow the data collection process of the source dataset. The performance of the current top-tier VLMs in Figure 1 shows distributional overfitting on text recognition tasks.

For text understanding, we conduct minimal perturbations on DocVQA and TextVQA, which means that only the area of target answer is changed, and any other visual element that has been semantically related to the old answer has been simultaneously modified. Two edited examples are shown in Figure 2:

(2) Text recognition and understanding in richer textual context(DocVQA) might be slightly more robust than recognizing text from scenes with very little textual context(TextVQA).

New collected benchmark

Based on our preliminary findings, We assemble our re-collection of new data as the first batch of our new benchmark, named Text Recognition and Understanding Evaluation suite (TRUE). The first version of our benchmark includes 1,146 image-text pairs (a subset of 573 forming a more challenging hard set), including:

Q: What is the total amount of this receipt? Answer this question using the text in the image directly. A: 31.91

food — Q: what is the value for Carbohydrates Per Serving? Answer this question using the text in the image directly. A: 3g

Q: Identify the brand shown in the image and provide your answer exactly as it appears. A: PolyStation

Q:What is the title of the book written by BELTING in the image? Answer this question using the text in the image directly. A: FACE AND MASK

gen — Q: Extract all text you can find in the image. A: [“GALLOP”, “AIR”]

Q: Extract all text content from the image. A: [“wednesday, nine of june 2010.”, “Natalia”]

scene — Q: On what date is the festival scheduled featuring Gwada Mike? A: [“July”, “22”, “2023”]

diet — Is the product in the image nut-free? If the answer is No, please specify the questionable ingredients. Otherwise please answer ‘Yes’. A: Yes

doc — Q: What is the purpose of the attached handouts mentioned in the email? A: self study module

chart — Q: What is the combined retail sales share of IKEA worldwide in fiscal year 2019 for the categories of Living room, Children’s IKEA, and IKEA food? A: 29%

Model performance on hard subset of TRUE

Model	All	HW	SceneOCR	FakeBrands	BookVQA	DietVQA	Chart&Info	SceneVQA
gemini-2.5-pro-preview-03-25	81.6	88.2	68.4	67.9	95.4	68.8	83.6	80.3
gemini-2.0-pro-exp	75.0	73.7	73.7	75.0	82.9	70.5	72.6	69.0
gemini-2.0-flash-exp	75.0	89.5	54.4	67.9	91.1	65.2	72.6	69.0
gpt-4o-2024-11-20	66.9	60.5	70.2	57.1	78.7	60.7	54.8	64.8
claude-3-7-sonnet-20250219	51.7	60.5	35.1	28.6	38.5	61.6	75.3	62.0
Qwen2.5-VL-72B-Instruct	64.9	76.3	70.2	57.1	55.8	68.8	72.6	66.2
InternVL2_5-78B	60.9	50.0	45.6	50.0	72.4	63.4	57.5	54.9
Qwen2.5-VL-32B-Instruct	59.9	57.9	49.1	35.7	50.0	65.2	80.8	73.2
Pixtral-Large-Instruct-2411	57.1	50.0	45.6	42.9	53.5	73.2	72.6	43.7
Mistral-Small-3.1-24B	51.2	73.7	31.6	17.9	47.7	54.5	72.6	49.3
Qwen2.5-VL-7B-Instruct	49.4	55.3	56.1	60.7	50.0	30.4	56.2	57.8
Molmo-72B-0924	47.9	36.8	38.6	35.7	51.9	52.7	46.6	50.7
Meta-Llama-3.2-90B-Vision	41.5	21.1	54.4	21.4	55.7	28.6	49.3	31.0
Pixtral-12B-2409	41.0	39.5	24.6	28.6	33.3	56.2	67.1	28.2
MiniCPM-o-2_6	40.7	55.3	42.1	17.9	35.1	36.6	58.9	42.2
InternVL2_5-8B	38.0	26.3	38.6	25.0	39.7	36.6	46.6	38.0
Molmo-7B-D-0924	32.8	18.4	40.4	35.7	34.2	23.2	39.7	38.0
Qwen2.5-VL-3B-Instruct	32.7	57.9	28.1	25.0	29.3	28.6	37.0	36.6
llava-onevision-qwen2-72b	31.8	10.5	42.1	21.4	0.6	62.5	50.7	47.9
llava-onevision-qwen2-7b	20.4	13.2	26.3	17.9	4.6	36.6	26.0	28.2
Idefics3-8B-Llama3	16.6	5.3	3.5	7.1	17.2	22.3	21.9	21.1
Meta-Llama-3.2-11B-Vision	15.9	0.0	3.5	7.1	39.1	2.7	15.1	2.8
Phi-4-multimodal-instruct	15.4	2.6	10.5	14.3	3.5	20.5	35.6	26.8

BibTeX citation

    @misc{baaiflageval2025true,
  author = "{BAAI FlagEval Team}",
  title = "A Status Check on Current Vision-Language Models in Text Recognition and Understanding",
  year = "2025",
  howpublished = "https://flageval-baai.github.io/TRUE/",
}