arXiv 2503.11651

VGGT: Visual Geometry Grounded Transformer

By Jianyuan Wang, Minghao Chen, et al.

Published 2025-03-14

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under…

View the original paper on arXiv