arXiv 2210.03347

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

By Kenton Lee, Mandar Joshi, et al.

Published 2022-10-07

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual languag…

View the original paper on arXiv