arXiv 2210.03347
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
By Kenton Lee, Mandar Joshi, et al.
Published 2022-10-07
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual languag…