SACReg: Scene-Agnostic Coordinate Regression for Visual Localization

Jérôme Revaud, Yohann Cabon, Romain Brégier, JongMin Lee, Philippe Weinzaepfel



Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every pixel of a given image, has recently shown promising potential. However, existing methods remain mostly scene-specific or limited to small scenes and thus hardly scale to realistic datasets. In this paper, we propose a new paradigm where a single generic SCR model is trained once to be then deployed to new test scenes, regardless of their scale and without further finetuning. For a given query image, it collects inputs from off-the-shelf image retrieval techniques and Structure-from-Motion databases: a list of relevant database images with sparse pointwise 2D-3D annotations. The model is based on the transformer architecture and can take a variable number of images and sparse 2D-3D annotations as input. It is trained on a few diverse datasets and significantly outperforms other scene regression approaches on several benchmarks, including scene-specific models, for visual localization. In particular, we set a new state of the art on the Cambridge localization benchmark, even outperforming feature-matching-based approaches.

Method overview

Given a query image and a set of related views with sparse 2D/3D annotations retrieved from a database, SACReg predicts absolute 3D coordinates for each pixel of the query image. This can be used for visual localization using a robust PnP algorithm. Importantly, SACReg is scene-agnostic: it does not need any retraining for new datasets, only the images and 2D-3D annotations that serve as input are scene-specific.

Regression examples

Below are regression examples on Aachen-Day, a dataset on which SACReg has not been trained. Our model predicts a dense 3D coordinates point map and a confidence map for a given query image using reference images retrieved from a SfM database. Only the first 3 reference images (out of 8) are depicted. 3D coordinates and confidence are colorized and low-confidence areas are not displayed, for visualization purposes.

3D reconstruction

The 3D coordinates predicted by SACReg can be used to lift a query image into a dense 3D colored point-cloud. Because SACReg regress directly regress scene coordinates, point clouds corresponding to different query images can easily be merged into a single large scale 3D reconstruction. We illustrate it in this video, where we collected point clouds predicted for each query image from the Aachen-Day dataset, removed low-confidence 3D points, and simply concatenated all 3D point clouds together to achieve a 3D reconstruction of Aachen.


title={{SACReg: Scene-Agnostic Coordinate Regression for Visual Localization}}, 
author={{Revaud, J\'er\^ome and Cabon, Yohann and Br\'egier, Romain and Lee, JongMin and Weinzaepfel, Philippe}},