E-rayzer: Self-Supervised 3D Reconstruction Learns Explicit Geometry from Unlabeled Images

Summarize this article with:
The challenge of building computers that understand three-dimensional space remains a key goal in artificial intelligence, and recent advances in self-supervised learning offer a promising path forward. Qitao Zhao from Carnegie Mellon University, Hao Tan and Kai Zhang from Adobe Research, along with Qianqian Wang, Sai Bi, and Kalyan Sunkavalli, present E-RayZer, a new method that learns to reconstruct 3D scenes directly from images without requiring labelled data. This approach differs from previous techniques by operating directly in three-dimensional space, creating geometrically accurate representations and avoiding potential shortcuts in learning.
The team demonstrates that E-RayZer not only surpasses existing self-supervised methods, such as RayZer, in tasks like pose estimation and reconstruction quality, but also outperforms leading visual pre-training models when applied to various 3D vision problems, establishing a new standard for 3D-aware computer vision. Gaussian Splatting for Fast 3D Reconstruction Recent research demonstrates significant progress in 3D reconstruction and scene understanding, with a focus on techniques like Gaussian Splatting and Neural Radiance Fields. Gaussian Splatting has emerged as a particularly promising method for rapidly creating high-quality 3D models. Researchers are also exploring self-supervised learning approaches, allowing systems to learn 3D representations from unlabeled images, and leveraging large-scale datasets to drive advancements in the field. Diffusion models are increasingly being applied to various 3D vision tasks, and there is growing interest in extending these techniques to video data. These developments are collectively pushing the boundaries of what’s possible in 3D computer vision. Key areas of investigation include pose estimation, structure from motion, and the creation of large-scale datasets for training and evaluation. Scientists are developing methods to accurately determine the 3D position and orientation of cameras, and to reconstruct the structure of scenes from multiple images. Datasets like ScanNet++, BlendedMVS, and SpatialVid provide valuable resources for training and benchmarking algorithms. Furthermore, masked autoencoders are proving effective for video representation learning, and researchers are exploring techniques for depth estimation and stereo vision. Explicit 3D Reconstruction via Photometric Self-Supervision Scientists have developed E-RayZer, a novel self-supervised 3D vision model that learns representations directly from unlabeled images, establishing a new approach to 3D-aware visual pre-training. Unlike prior methods, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with explicit geometry. This approach grounds representations in accurate scene geometry, yielding genuinely 3D-aware features. To address challenges in training with explicit 3D geometry, the scientists implemented a learning curriculum based on visual overlap between input views. Training commenced with samples exhibiting high visual overlap, allowing the pose estimator to initialize effectively, and gradually progressed to samples with reduced overlap. This curriculum stabilizes training and ensures convergence. The method involves rendering predicted 3D Gaussians and comparing them to input images, driving the model to learn accurate 3D reconstructions. Experiments demonstrate that E-RayZer significantly outperforms previous methods on pose estimation and matches or surpasses fully supervised reconstruction models, demonstrating its ability to learn robust 3D representations from unlabeled data. The learned representations also outperform leading visual pre-training models when transferred to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training. Unsupervised 3D Vision with Explicit Geometry Scientists have developed E-RayZer, a new approach to 3D vision that learns representations directly from unlabeled images, achieving a breakthrough in self-supervised learning. Unlike previous methods that indirectly inferred 3D structure, E-RayZer operates directly in 3D space, reconstructing scenes with explicit geometry. This formulation avoids shortcuts and yields geometrically grounded representations, enabling a deeper understanding of 3D environments. The research team introduced a novel learning curriculum, organizing training from easy to hard samples and harmonizing diverse data sources in an entirely unsupervised manner, ensuring both convergence and scalability of the system. Experiments demonstrate that E-RayZer significantly outperforms its predecessor on pose estimation tasks, showcasing a substantially improved ability to accurately determine camera positions and orientations. Furthermore, E-RayZer achieves performance on par with, and sometimes surpasses, fully supervised reconstruction methods, despite being trained without any manual annotations. Tests reveal that E-RayZer achieves comparable scaling patterns to supervised models, demonstrating its efficiency and potential for large-scale applications. The learned representations from E-RayZer also outperform leading visual pre-training methods when applied to downstream 3D tasks, establishing E-RayZer as a strong paradigm for spatial visual pre-training and unlocking the potential for more robust and accurate 3D understanding in computer vision. Direct 3D Learning From Multi-View Images E-RayZer represents a significant advance in 3D computer vision, establishing a novel approach to learning geometrically grounded representations from multi-view images. Researchers developed a self-supervised 3D reconstruction model that operates directly in 3D space, unlike previous methods which inferred 3D information indirectly. This direct approach eliminates potential shortcuts and yields representations firmly rooted in geometric accuracy, demonstrating improved performance compared to existing unsupervised techniques and achieving results comparable to fully supervised methods. Extensive experiments confirm the effectiveness of E-RayZer, with its learned representations consistently outperforming leading visual pre-training models when applied to various 3D downstream tasks.
The team also introduced a fine-grained learning curriculum, organizing training from easier to more challenging samples and harmonizing diverse data sources without manual tuning. This curriculum demonstrably improves both pose estimation and reconstruction accuracy, while also enhancing scalability. 👉 More information 🗞 E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training 🧠 ArXiv: https://arxiv.org/abs/2512.10950 Tags:
