Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

Uni3R takes unposed multi-view images and produces a unified 3D Gaussian scene representation, enabling view synthesis, semantic segmentation, and depth estimation within a single forward pass. The radar chart compares Uni3R with existing methods, showing consistently competitive or superior performance.

Abstract

Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction—all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including RE10K and ScanNet datasets. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding.

Framework

Architectural overview of the Uni3R pipeline for multi-task 3D reconstruction and scene understanding. Uni3R predicts a set of Gaussian primitives with jointly integrated geometry, appearance, and open-vocabulary semantics in a single pass, eliminating the need for per-scene optimization. Its cross-frame attention mechanism enables robust feature fusion to produce globally consistent scene representations from an arbitrary number of input views, while its predicted point maps provide potent geometric guidance.

Results

Novel-View Segmentation

Quantitative Comparison on ScanNet.

We evaluate performance on novel view synthesis, depth estimation and open-vocabulary semantic segmantation. Unlike LSM, Uni3R is trained without ground truth 3D point clouds.

ScanNet Multi-Views Input (2, 8 and 16 views)

Quantitative Comparison on RE10K.

Comparison with 4 and 8-views on the RE10k dataset and zero-shot performance on the ScanNet dataset.

RE10K Multi-Views Input (4 and 8 views)

BibTeX

@misc{sun2025uni3runified3dreconstruction,
      title={Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images}, 
      author={Xiangyu Sun and Haoyi Jiang and Liu Liu and Seungtae Nam and Gyeongjin Kang and Xinjie Wang and Wei Sui and Zhizhong Su and Wenyu Liu and Xinggang Wang and Eunbyung Park},
      year={2025},
      eprint={2508.03643},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.03643}, 
}