arXiv 2503.11651

VGGT: Visual Geometry Grounded Transformer

By Jianyuan Wang, Minghao Chen, et al.

Published 2025-03-14

Citation lineage

Review the prior work and downstream research connected to this paper.

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under…

View the original paper on arXiv