arXiv 2102.05095

Is Space-Time Attention All You Need for Video Understanding?

By Gedas Bertasius, Heng Wang, et al.

Published 2021-02-09

Discussion

Read the public discussion and references gathered around this paper.

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal…

View the original paper on arXiv