arXiv 2203.12602

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

By Zhan Tong, Yibing Song, et al.

Published 2022-03-23

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design make…

View the original paper on arXiv