arXiv 2210.03347

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

By Kenton Lee, Mandar Joshi, et al.

Published 2022-10-07

Discussion

Read the public discussion and references gathered around this paper.

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual languag…

View the original paper on arXiv