Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo-floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondence between images and floor plans. C3 contains 90K paired floor plans and photos across 597 scenes with 153M pixel-level correspondences and 85K camera poses. We find that state-of-the-art correspondence models struggle on this task. By training on our new data, we can improve on the best performing method by 34% in RMSE. However, we also identify open challenges in cross-modal geometric reasoning that our dataset aims to help address.
Our goal is to create a dataset that consists of paired floor plans and photos and annotated correspondences between them. We achieve this through the following steps:
We evaluate our dataset with a combination of sparse, semi-dense and dense matching algorithms: SuperGlue, LoFTR, DINOv2, DIFT, RoMa, and MASt3R. We also evaluate on DUSt3R. While the correspondence baseline results were rather poor, we observe promising geometric structures in DUSt3R outputs; the model only needed to learn the 2D translations, rotation and scale to align to the floor plan. We therefore leverage the strong geometric prior from the pretrained DUSt3R model and finetune on our dataset.
We use the DUSt3R framework with the floor plan as the reference image, which defines the 3D coordinate frame, along with a photo. DUSt3R then generates a pointmap that associates each pixel in the photo with a 3D point (x, y, z) in the coordinate system of the floor plan. To project onto the floor plan, we apply an orthographic projection by dropping the y-coordinate -- i.e. (x, y, z) → (x, z) -- since the y-axis represents the vertical (up) direction in the floor plan's coordinate frame.
We find that correct correspondence predictions by C3Po is generally accompanied by high confidence scores, while incorrect predictions have low confidence scores. The photos from lower confidence results usually exhibit ambiguity, like the cases identified in Open Challenges, while it is more obvious to identify the camera pose of photos from the higher confidence results. This is further corroborated by the PR curves, which show that C3Po significantly outperforms state-of-the-art models because more confident correspondence predictions are more likely to be correct.
We share two categories of data that our model struggles on. Challenge 1, top two rows, are cases where the photo provides minimal context clues of where it could be on the floor plan. Challenge 2, bottom two rows, are scenes that exhibit structural symmetry, where multiple correspondence alignments would seem plausible. In all cases, our model makes plausible predictions but the answers are wrong due to lack of context or structural ambiguity.
@inproceedings{
huang2025c3po,
title={C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction},
author={Huang, Kuan Wei and Li, Brandon and Hariharan, Bharath and Snavely, Noah},
booktitle={Advances in Neural Information Processing Systems},
volume={38},
year={2025}
}