HumanEval-V

HumanEval-V is a novel benchmark designed to evaluate the ability of Large Multimodal Models (LMMs) to understand and reason over complex diagrams in programming contexts. Unlike traditional multimodal or coding benchmarks, HumanEval-V challenges models to generate Python code based on visual inputs that are indispensable for solving the task. Our dataset consists of 253 human-annotated coding tasks, each requiring LMMs to perceive, interpret, and reason over diagrams to produce functionally correct code solutions.

Why HumanEval-V?

Despite recent advancements in multimodal reasoning, existing benchmarks focus primarily on scientific, mathematical, chart-based analysis, or abstract visual reasoning (in IQ tests), assessing models' domain knowledge or deduction abilities. These benchmarks don't fully challenge models to understand complex diagrams in the way that humans do.

HumanEval-V addresses this gap by introducing coding tasks where the diagram alone encodes most of the problem context. Models must perform advanced visual reasoning without relying on lengthy textual descriptions, pushing the boundaries of vision reasoning capabilities.

Key Features

Task types in HumanEval-V, and the capability aspects required for understanding diagrams in HumanEval-V

Indispensable visual context: Each task includes a self-contained diagram, eliminating reliance on detailed textual descriptions.
Diverse and realistic problem types: The dataset spans six distinct categories, covering a wide range of visual reasoning abilities.
Code generation task: Unlike many multimodal benchmarks, which rely on MCQ or short-answer tasks, HumanEval-V requires models to generate executable code, ensuring a more rigorous evaluation of diagram comprehension.
Structured evaluation pipeline: We introduce a two-stage evaluation approach where LMMs only need to generate a structured diagram description, which will be translated into code by a separate strong coder model. This ensures that visual understanding is explicitly assessed rather than conflated with coding proficiency.
Execution-based evaluation: Solutions are tested using handcrafted test cases and scored with the pass@k metric, providing an objective measure of correctness.

Challenges for LMMs

Top-performing models struggle, with Claude 3.5 Sonnet achieving only 36.8% pass@1, while Pixtral 124B reaches 21.3% pass@1.
LMMs perform better at diagram description than direct code generation, revealing a gap in their vision-to-code capabilities.
Sampling and iterative refinement improve results, with Claude 3.5 Sonnet reaching 74.3% pass@1 with 100 samples and 55.3% pass@1 with four self-refinement iterations.
Models struggle with tasks trivial for humans, especially in spatial transformations, topological relationships, and dynamic patterns.

HumanEval-V

Benchmarking High-Level Visual Reasoning with
Complex Diagrams in Coding Tasks

👀Introduction

Why HumanEval-V?

Key Features

Challenges for LMMs

🛠 Benchmark Construction

📁Diagram Examples

⚙ Evaluation Setting

🏆Leaderboard🏆

BibTeX

HumanEval-V

Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

👀Introduction

Why HumanEval-V?

Key Features

Challenges for LMMs

🛠 Benchmark Construction

📁Diagram Examples

⚙ Evaluation Setting

🏆Leaderboard🏆

BibTeX

Benchmarking High-Level Visual Reasoning with
Complex Diagrams in Coding Tasks