arXiv 2210.03347

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

By Kenton Lee, Mandar Joshi, et al.

Published 2022-10-07

Citation lineage

Review the prior work and downstream research connected to this paper.

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual languag…

View the original paper on arXiv