arXiv 2305.06988

Self-Chained Image-Language Model for Video Localization and Question Answering

By Shoubin Yu, Jaemin Cho, et al.

Published 2023-05-11

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to…

View the original paper on arXiv