The dataset viewer is not available for this split.
Error code: StreamingRowsError
Exception: ValueError
Message: Invalid string class label OddGridBench@9387cde0416686d6e9ed91633844f948846d07fa
Traceback: Traceback (most recent call last):
File "/src/services/worker/src/worker/utils.py", line 99, in get_rows_or_raise
return get_rows(
^^^^^^^^^
File "/src/libs/libcommon/src/libcommon/utils.py", line 272, in decorator
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/src/services/worker/src/worker/utils.py", line 77, in get_rows
rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/iterable_dataset.py", line 2690, in __iter__
for key, example in ex_iterable:
^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/iterable_dataset.py", line 2240, in __iter__
example = _apply_feature_types_on_example(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/iterable_dataset.py", line 2157, in _apply_feature_types_on_example
encoded_example = features.encode_example(example)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 2152, in encode_example
return encode_nested_example(self, example)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1437, in encode_nested_example
{k: encode_nested_example(schema[k], obj.get(k), level=level + 1) for k in schema}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1460, in encode_nested_example
return schema.encode_example(obj) if obj is not None else None
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1143, in encode_example
example_data = self.str2int(example_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1080, in str2int
output = [self._strval2int(value) for value in values]
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/features/features.py", line 1101, in _strval2int
raise ValueError(f"Invalid string class label {value}")
ValueError: Invalid string class label OddGridBench@9387cde0416686d6e9ed91633844f948846d07faNeed help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
🚀 New! Our paper "OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models" has been accepted to CVPR 2026! 🎉
This repository contains the official evaluation code and data for our work.
📚 Read the paper: arXiv PDF | arXiv Page
Introduction
MLLMs have achieved remarkable performance across a wide range of vision-language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis.In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs—including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5—perform far below human levels in visual discrepancy detection.We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model’s fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence.
Dataset Creation
OddGridBench is designed to systematically evaluate the visual discrepancy sensitivity of multimodal large language models in a controlled and interpretable setting. The benchmark consists of over 1,400 grid-based images, where most elements follow a shared visual pattern and only one element deviates from the others. The discrepancy can arise from one or multiple visual attributes, including color, size, rotation, and position, enabling fine-grained assessment of perceptual sensitivity under varying difficulty levels. Please refer to our huggingface 🤗 Dataset for more details.
Load Dataset
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("wwwtttjjj/OddGridBench")
Load Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "wwwtttjjj/OddGrid-GRPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
Citation
BibTeX:
@inproceedings{weng2026oddgridbench,
title={OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models},
author={Weng, Tengjin and Jiang, Wenhao and Wang, Jingyi and Li, Ming and Ma, Lin and Ming, Zhong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
- Downloads last month
- 32