C3Po

Abstract

Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo-floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondence between images and floor plans. C3 contains 90K paired floor plans and photos across 597 scenes with 153M pixel-level correspondences and 85K camera poses. We find that state-of-the-art correspondence models struggle on this task. By training on our new data, we can improve on the best performing method by 34% in RMSE. However, we also identify open challenges in cross-modal geometric reasoning that our dataset aims to help address.

Dataset Creation

Our goal is to create a dataset that consists of paired floor plans and photos and annotated correspondences between them. We achieve this through the following steps:

Source floor plans from Wikimedia Commons grouped by scenes.
Collect photos of corresponding scenes from MegaScenes and YFCC100M.
Determine correspondences between plans and photos by

Running COLMAP on the photo collection of each scene which estimates a 3D camera pose for each image and a sparse point cloud corresponding to keypoints in the photo collection.
Manually aligning reconstructed point clouds with floor plan (with a custom user interface) thus directly yielding correspondences between individual photos and the floor plan.

Evaluation

We evaluate our dataset with a combination of sparse, semi-dense and dense matching algorithms: SuperGlue, LoFTR, DINOv2, DIFT, RoMa, and MASt3R. We also evaluate on DUSt3R. While the correspondence baseline results were rather poor, we observe promising geometric structures in DUSt3R outputs; the model only needed to learn the 2D translations, rotation and scale to align to the floor plan. We therefore leverage the strong geometric prior from the pretrained DUSt3R model and finetune on our dataset.

Correspondence by Pointmap Prediction

We use the DUSt3R framework with the floor plan as the reference image, which defines the 3D coordinate frame, along with a photo. DUSt3R then generates a pointmap that associates each pixel in the photo with a 3D point (x, y, z) in the coordinate system of the floor plan. To project onto the floor plan, we apply an orthographic projection by dropping the y-coordinate -- i.e. (x, y, z) → (x, z) -- since the y-axis represents the vertical (up) direction in the floor plan's coordinate frame.

Prediction Confidence Score and Correspondence Quality

We find that correct correspondence predictions by C3Po is generally accompanied by high confidence scores, while incorrect predictions have low confidence scores. The photos from lower confidence results usually exhibit ambiguity, like the cases identified in Open Challenges, while it is more obvious to identify the camera pose of photos from the higher confidence results. This is further corroborated by the PR curves, which show that C3Po significantly outperforms state-of-the-art models because more confident correspondence predictions are more likely to be correct.

Open Challenges

We share two categories of data that our model struggles on. Challenge 1, top two rows, are cases where the photo provides minimal context clues of where it could be on the floor plan. Challenge 2, bottom two rows, are scenes that exhibit structural symmetry, where multiple correspondence alignments would seem plausible. In all cases, our model makes plausible predictions but the answers are wrong due to lack of context or structural ambiguity.

BibTeX


    @inproceedings{
      huang2025c3po,
      title={C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction}, 
      author={Huang, Kuan Wei and Li, Brandon and Hariharan, Bharath and Snavely, Noah},
      booktitle={Advances in Neural Information Processing Systems},
      volume={38},
      year={2025}
    }

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction