Logo HumanEval-V

Benchmarking High-Level Visual Reasoning with
Complex Diagrams in Coding Tasks

Fengji Zhang*†,1, Linquan Wu*1, Bai Huiyu*1, Guancheng Lin*2,
Xiao Li3, Xiao Yu4, Yue Wang5, Bei Chen5, Jacky Keung1

1CityU Hong Kong, 2Wuhan University, 3Tsinghua University, 4Zhejiang University, 5Rhymes AI

*Core Contributors
†Corresponding to: fengji.zhang@my.cityu.edu.hk
HumanEval-V Coding Task

An example task in HumanEval-V. Each task involves completing a Python function based on
a single diagram, the function signature, and simple instructions provided in the comment block.

👀Introduction

HumanEval-V is a novel benchmark designed to evaluate the ability of Large Multimodal Models (LMMs) to understand and reason over complex diagrams in programming contexts. Unlike traditional multimodal or coding benchmarks, HumanEval-V challenges models to generate Python code based on visual inputs that are indispensable for solving the task. Our dataset consists of 253 human-annotated coding tasks, each requiring LMMs to perceive, interpret, and reason over diagrams to produce functionally correct code solutions.

Why HumanEval-V?

Despite recent advancements in multimodal reasoning, existing benchmarks focus primarily on scientific, mathematical, chart-based analysis, or abstract visual reasoning (in IQ tests), assessing models' domain knowledge or deduction abilities. These benchmarks don't fully challenge models to understand complex diagrams in the way that humans do.

HumanEval-V addresses this gap by introducing coding tasks where the diagram alone encodes most of the problem context. Models must perform advanced visual reasoning without relying on lengthy textual descriptions, pushing the boundaries of vision reasoning capabilities.

Key Features
Task types in HumanEval-V, and the capability aspects required for understanding diagrams in HumanEval-V
  • Indispensable visual context: Each task includes a self-contained diagram, eliminating reliance on detailed textual descriptions.
  • Diverse and realistic problem types: The dataset spans six distinct categories, covering a wide range of visual reasoning abilities.
  • Code generation task: Unlike many multimodal benchmarks, which rely on MCQ or short-answer tasks, HumanEval-V requires models to generate executable code, ensuring a more rigorous evaluation of diagram comprehension.
  • Structured evaluation pipeline: We introduce a two-stage evaluation approach where LMMs only need to generate a structured diagram description, which will be translated into code by a separate strong coder model. This ensures that visual understanding is explicitly assessed rather than conflated with coding proficiency.
  • Execution-based evaluation: Solutions are tested using handcrafted test cases and scored with the pass@k metric, providing an objective measure of correctness.
Challenges for LMMs
  • Top-performing models struggle, with Claude 3.5 Sonnet achieving only 36.8% pass@1, while Pixtral 124B reaches 21.3% pass@1.
  • LMMs perform better at diagram description than direct code generation, revealing a gap in their vision-to-code capabilities.
  • Sampling and iterative refinement improve results, with Claude 3.5 Sonnet reaching 74.3% pass@1 with 100 samples and 55.3% pass@1 with four self-refinement iterations.
  • Models struggle with tasks trivial for humans, especially in spatial transformations, topological relationships, and dynamic patterns.

🛠 Benchmark Construction

HumanEval-V Construction

We construct HumanEval-V following a collect-distill-recreate-diversify pipeline. After constructing the benchmark,
we perform rigorous validation to ensure that each coding task aligns with high-quality standards.


📁Diagram Examples

HumanEval-V Examples

HumanEval-V includes visual elements like trees, graphs, matrices, maps, grids, flowcharts, and more. The visual contexts are designed to be indispensable and self-explanatory, embedding rich contextual information and algorithmic patterns.


⚙ Evaluation Setting

HumanEval-V Evaluation Pipelines

We use a structured evaluation pipeline to assess visual reasoning and coding efficiency separately,
ensuring that models' abilities are evaluated in a decoupled manner.

🏆Leaderboard🏆

Open-Weight Proprietary

The best performance is shown in bold, while the second-best is indicated by underlining. You can sort pass@1 or pass@3 by clicking on the column headers.

Models Source V2C V2C w/ CoT V2T2C V2T2C w/ GPT-4o
pass@1 pass@3 pass@1 pass@3 pass@1 pass@3 pass@1 pass@3

  • pass@1 is computed using greedy decoding, whereas pass@3 is based on 6 generated samples with sampling parameters of t=0.8, topk=20 and topp=0.95.

BibTeX


@article{zhang2024humanevalv,
  title={HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks}, 
  author={Zhang, Fengji and Wu, Linquan and Bai, Huiyu and Lin, Guancheng and Li, Xiao and Yu, Xiao and Wang, Yue and Chen, Bei and Keung, Jacky},
  journal={arXiv preprint arXiv:2410.12381},
  year={2024},
}