arXiv 2102.05095
Is Space-Time Attention All You Need for Video Understanding?
By Gedas Bertasius, Heng Wang, et al.
Published 2021-02-09
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal…