arXiv 2403.15377
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
By Yi Wang, Kunchang Li, et al.
Published 2024-03-22
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize…