The Landscape of Multimodal Evaluation Benchmarks

The Landscape of Multimodal Evaluation Benchmarks

Introduction

With the huge advancements happening in the field of large language models (LLMs), models that can process multimodal inputs have recently been coming to the forefront of the field. These models can take both text and images as input, and sometimes other modalities as well, such as video or speech.

Multimodal models present unique challenges in evaluation. In this blog post, we will take a look at a few multimodal datasets which can be used to assess the performance of such models, mostly ones focused on visual question answering (VQA), where a question needs to be answered using information from an image. 

The landscape of multimodal datasets is large and ever growing, with benchmarks focusing on different perception and reasoning capabilities, data sources, and applications. The list of datasets here is by no means exhaustive. We will briefly describe the key features of ten multimodal datasets and benchmarks and outline a few key trends in the space.

Multimodal Datasets

TextVQA

There are different types of vision-language tasks that a generalist multimodal language model could be evaluated on. One such task is optical character recognition (OCR) and answering questions based on text present in an image. One dataset evaluating this type of abilities is TextVQA, a dataset released in 2019 by Singh et al.

Two examples from TextVQA (Singh et al., 2019)

As the dataset is focused on text present in images, a lot of images are of things like billboards, whiteboards, or traffic signs. In total, there are 28,408 images from the OpenImages dataset and 45,336 questions associated with them, which require reading and reasoning about text in the images. For each question, there are 10 ground truth answers provided by annotators. 

DocVQA

Similarly to TextVQA, DocVQA deals with reasoning based on text in an image, but it is more specialized: in DocVQA, the images are of documents, which contain things such as tables, forms, and lists, and come from sources in e.g. chemical or fossil fuel industry. There are 12,767 images from 6,071 documents and 50,000 questions associated with these images. The authors also provide a random split of the data into train (80%), validation (10%), and test (10%) sets.

Example question-answer pairs from DocVQA (Mathew et al., 2020)

OCRBench

The two datasets mentioned above are far from the only ones available for OCR-related tasks. If one wishes to perform a comprehensive evaluation of a model, it may be expensive and time-consuming to run evaluation on all testing data available. Because of this, samples of several related datasets are sometimes combined into a single benchmark which is smaller than the combination of all individual datasets, and more diverse than any single source dataset.

For OCR-related tasks, one such dataset is OCRBench by Liu et al. It consists of 1,000 manually verified question-answer pairs from 18 datasets (including TextVQA and DocVQA described above). Five main tasks are covered by the benchmark: text recognition, scene text-centric VQA, document-oriented VQA, key information extraction, and handwritten mathematical expression recognition.

Examples of text recognition (a), handwritten mathematical expression recognition (b), and scene text-centric VQA (c) tasks in OCRBench (Liu et al., 2023)

MathVista

There also exist compilations of multiple datasets for other specialized sets of tasks. For example, MathVista by Lu et al. is focused on mathematical reasoning. It includes 6,141 examples coming from 31 multimodal datasets which involve mathematical tasks (28 previously existing datasets and 3 newly created ones).

Examples from datasets annotated for MathVista (Lu et al., 2023)

The dataset is partitioned into two splits: testmini (1,000 examples) for evaluation with limited resources, and test (the remaining 5,141 examples). To combat model overfitting, answers for the test split are not publicly released.

LogicVista

Another relatively specialized capability that can be evaluated in multimodal LLMs is logical reasoning. One dataset that is intended to do this is the very recently released LogicVista by Xiao et al. It contains 448 multiple-choice questions covering 5 logical reasoning tasks and 9 capabilities. These examples are collected from licensed intelligence test sources and annotated. Two examples from the dataset are shown in the image below.

Examples from the LogicVista dataset (Xiao et al., 2024)

RealWorldQA

As opposed to narrowly defined tasks such as ones involving OCR or mathematics, some datasets cover broader and less restricted objectives and domains. For instance, RealWorldQA is a dataset of over 700 images from the real world, with a question for each image. Although most images come from vehicles and depict driving situations, some show more general scenes with multiple objects in them. Questions are of different types: some have multiple choice options, while others are open, with included instructions like “Please answer directly with a single word or number”.

Example image, question, and answer combinations from RealWorldQA

MMBench

In a situation when different models are competing to have the best scores on fixed benchmarks, overfitting of models to benchmarks becomes a concern. When a model overfits, it means that it will show very good results on a certain dataset, even though this strong performance does not generalize to other data well enough. To battle this, there is a recent trend to only release the questions of a benchmark publicly, but not the answers. For example, the MMBench dataset is split into dev and test subsets, and while dev is released together with answers, test is not. This dataset consists of 3,217 multiple choice image-based questions covering 20 fine-grained abilities, which are defined by the authors as belonging to coarse groups of perception (e.g. object localization, image quality) and reasoning (e.g. future prediction, social relation).

Results of eight vision-language models on the 20 abilities defined in MMBench-test, as tested by Liu et al. (2023)

An interesting feature of the dataset is that, in contrast to most other datasets where all questions are in English, MMBench is bilingual, with English questions additionally translated into Chinese (the translations are done automatically using GPT-4 and then verified).

To verify the consistency of the models’ performance and reduce the chance of a model answering correctly by accident, the authors of MMBench ask the same question from the models several times with the order of multiple choice options shuffled.

MME

Another benchmark for comprehensive evaluation of multimodal abilities is MME by Fu et al. This dataset covers 14 subtasks related to perception and cognition abilities. Some images in MME come from existing datasets, and some are novel and taken manually by the authors. MME differs from most datasets described here in the way its questions are posed. All questions require a “yes” or “no” answer. To better evaluate the models, two questions are designed for each image, such that the answer is to one of them is “yes” and to the other “no”, and a model is required to answer both correctly to get a “point” for the task. This dataset is intended only for academic research purposes.

Examples from the MME benchmark (Fu et al., 2023)

MMMU

While most datasets described above evaluate multimodal models on tasks most humans could perform, some datasets focus on specialized expert knowledge instead. One such benchmark is MMMU by Yue et al.

Questions in MMMU require college-level subject knowledge and cover 6 main disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. In total, there are over 11,000 questions from college textbooks, quizzes, and exams. Image types include diagrams, maps, chemical structures, etc.

MMMU examples from two disciplines (Yue et al., 2023)

TVQA

The benchmarks mentioned so far incorporate two data modalities: text and images. While this combination is the most widespread, it should be noted that more modalities, such as video or speech, are being incorporated into large multimodal models. To bring one example of a multimodal dataset that includes video, we can look at the TVQA dataset by Lei et al., which was created in 2018. In this dataset, a few questions are asked about 60-90 seconds long video clips from six popular TV shows. For some questions, using only the subtitles or only the video is enough, while others require using both modalities.

Examples from TVQA (Lei et al., 2018)

Multimodal Inputs on Clarifai

With the Clarifai platform, you can easily process multimodal inputs. In this example notebook, you can see how the Gemini Pro Vision model can be used to answer an image-based question from the RealWorldQA benchmark.

Key Trends in Multimodal Evaluation Benchmarks

We have noticed a few trends related to multimodal benchmarks:

  • While in the era of smaller models specialized on a particular task a dataset would typically include both training and test data (e.g. TextVQA), with the increased popularity of generalist models pre-trained on vast amounts of data, we see more and more datasets intended solely for model evaluation.
  • As the number of available datasets grows, and the models become increasingly larger and more resource-intensive to evaluate, there is a trend of creating curated collections of samples from several datasets for smaller-scale but more comprehensive evaluation.
  • For some datasets, the answers, or in some cases even the questions, are not publicly released. This is intended to combat overfitting of models to specific benchmarks, where good scores on a benchmark do not necessarily indicate generally strong performance.

Conclusion

In this blog post, we briefly described a few datasets that can be used to evaluate multimodal abilities of vision-language models. It should be noted that many other existing benchmarks were not mentioned here. The variety of benchmarks is generally very broad: some datasets focus on a narrow task, such as OCR or math, while others aim to be more comprehensive and reflect the real world; some require general and some highly specialized knowledge; the questions may require a yes/no, a multiple choice, or an open answer.

Related articles

8 Significant Research Papers on LLM Reasoning

Simple next-token generation, the foundational technique of large language models (LLMs), is usually insufficient for tackling complex reasoning...

AI-Generated Masterpieces: The Blurring Lines Between Human and Machine Creativity

Hey there! Just the other day, I was admiring a beautiful painting at a local art gallery when...

Posit AI Blog: luz 0.4.0

A new version of luz is now available on CRAN. luz is a high-level interface for torch. It...