arXiv 2403.15377

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

By Yi Wang, Kunchang Li, et al.

Published 2024-03-22

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize…

View the original paper on arXiv