Logo HumanEval-V

A Lightweight Visual Understanding and Reasoning Benchmark for Evaluating LMMs through Coding Tasks

Fengji Zhang*†,1, Linquan Wu*1, Bai Huiyu*1, Guancheng Lin*2,
Xiao Li3, Xiao Yu4, Yue Wang5, Bei Chen5, Jacky Keung1

1CityU Hong Kong, 2Wuhan University, 3Tsinghua University, 4Zhejiang University, 5Rhymes AI

*Core Contributors
†Corresponding to: fengji.zhang@my.cityu.edu.hk
HumanEval-V Coding Task

An example coding task in HumanEval-V. Each task involves completing a Python function based on
a single image, the function signature, and problem descriptions provided in the comment block.

🔔News

  • [2024.11.20] Pixtral-Large-Instruct-2411 achieves new open-weight SOTA on HumanEval-V with 11.1 pass@1 and 26.9 pass@10 !
  • [2024.10.23] Claude 3.5 Sonnet (1022) achieves new SOTA on HumanEval-V with 25.9 pass@1 and 42.6 pass@10 !
  • [2024.10.23] We include more SOTA LMMs in the learderboard, including Pixtral, Llama-3.2-Vision, Aria, and Ovis1.6-Gemma2
  • [2024.10.23] We've updated the Parsing Success Rate metric (passing Pylint without errors). Many LMMs perform poorly, indicating a decline in generating syntactically correct code.
  • [2024.10.17] Our paper is now accessible at huggingface.co/papers/2410.12381 (#2 Paper of the day)

Introduction

Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess LMMs, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation tasks. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases for execution-based pass@k evaluation. We evaluate 20+ state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further demonstrate the limitations of current LMMs in vision reasoning and coding abilities. These results highlight key areas for future research to enhance LMMs' capabilities.

Benchmark Construction

HumanEval-V Construction

The construction of HumanEval-V follows a collect-adapt-mutate pipeline. After constructing the benchmark, we perform rigorous validation to ensure that each coding task aligns with the standards.


Visual Context Examples

HumanEval-V Examples

HumanEval-V includes visual elements like trees, graphs, matrices, maps, grids, flowcharts, and more. The visual contexts are designed to be indispensable and self-explanatory, embedding rich contextual information and algorithmic patterns.

🏆Leaderboard on HumanEval-V🏆

Open-Weight Proprietary

The best performance is shown in bold, while the second-best is indicated by underlining. You can sort Pass@1 or Pass@10 by clicking on the column headers.

Name Source Design Size Pass@1 Pass@10 PSR@1 PSR@10 Date

  • Pass@1 is computed using greedy decoding, whereas Pass@10 is based on 20 generated samples with sampling parameters of t=0.8 and p=0.95.
  • PSR represents the percentage of samples that pass Pylint checks without any errors.
  • The Encoder-Decoder design signifies that the model is trained using a vision encoder and a language model decoder.

BibTeX


@article{zhang2024humanevalv,
  title={HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks}, 
  author={Zhang, Fengji and Wu, Linquan and Bai, Huiyu and Lin, Guancheng and Li, Xiao and Yu, Xiao and Wang, Yue and Chen, Bei and Keung, Jacky},
  journal={arXiv preprint arXiv:2410.12381},
  year={2024},
}