arXiv 2405.21075

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

By Chaoyou Fu, Yuhan Dai, et al.

Published 2024-05-31

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of thei…

View the original paper on arXiv