arXiv 2102.05095

Is Space-Time Attention All You Need for Video Understanding?

By Gedas Bertasius, Heng Wang, et al.

Published 2021-02-09

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal…

View the original paper on arXiv