arXiv 2210.03347

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

By Kenton Lee, Mandar Joshi, et al.

Published 2022-10-07

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual languag…

View the original paper on arXiv