Jump to Content

Announcing the 2020 Image Matching Benchmark and Challenge

April 2, 2020

Posted by Eduard Trulls, Research Scientist, Google Maps



Reconstructing 3D objects and buildings from a series of images is a well-known problem in computer vision, known as Structure-from-Motion (SfM). It has diverse applications in photography and cultural heritage preservation (e.g., allowing people to explore the sculptures of Rapa Nui in a browser) and powers many services across Google Maps, such as the 3D models created from StreetView and aerial imagery. In these examples, images are usually captured by operators under controlled conditions. While this ensures homogeneous data with a uniform, high-quality appearance in the images and the final reconstruction, it also limits the diversity of sites captured and the viewpoints from which they are seen. What if, instead of using images from tightly controlled conditions, one could apply SfM techniques to better capture the richness of the world using the vast amounts of unstructured image collections freely available on the internet?

In order to accelerate research into this topic, and how to better leverage the volume of data already publicly available, we present, “Image Matching across Wide Baselines: From Paper to Practice”, a collaboration with UVIC, CTU and EPFL, that presents a new public benchmark to evaluate methods for 3D reconstruction. Following on the results of the first Image Matching: Local Features and Beyond workshop held at CVPR 2019, this project now includes more than 25k images, each of which includes accurate pose information (location and orientation). This data is publicly available, along with the open-sourced benchmark, and is the foundation of the 2020 Image Matching Challenge to be held at CVPR 20201.

Recovering 3D Structure In the Wild
Google Maps already uses images donated by users to inform visitors about popular locations or to update business hours. However, using this type of data to build 3D models is much more difficult, since donated photos have a wide variety of viewpoints, lighting and weather conditions, occlusions from people and vehicles, and the occasional user-applied filters. The examples below highlight the diversity of images for the Trevi Fountain in Rome.
Some example images sampled from the Image Matching Challenge dataset, showing different perspectives of the Trevi Fountain.
In general, the use of SfM to reconstruct 3D scenes starts by identifying which parts of the images capture the same physical points of a scene, the corners of a window, for instance. This is achieved using local features, i.e., salient locations in an image that can be reliably identified across different views. They contain short description vectors (model representations) that capture the appearance around the point of interest. By comparing these descriptors, one can establish likely correspondences between the pixel coordinates of image locations across two or more images, and recover the 3D location of the point by triangulation. Both the pose from where the images were captured as well as the 3D location of the physical points observed (for example, identifying where the corner of the window is relative to the camera location) can then be jointly estimated. Doing this over many images and points allows one to obtain very detailed reconstructions.
A 3D reconstruction generated from over 3000 images, including those from the previous figure.
The challenge for this approach is the risk of having incorrect correspondences due, for example, to repeated structure such as the windows of the building, that may be very similar to each other, or transient elements that do not persist across images, such as the crowds admiring the Trevi Fountain. One way to filter these out is by reasoning about relations between correspondences using multiple images. An additional, even more powerful approach is to design better methods for identifying and isolating local features, for instance, by ignoring points on transient elements such as people. But to better understand the shortcomings of existing local feature algorithms for SfM and to provide insight into promising directions for future research, it is necessary to have a reliable benchmark to measure performance.

A Benchmark for Evaluating Local Features for 3D Reconstruction
Local features power many Google services, such as Image Search and product recognition in Google Lens, and are also used in mixed reality applications, like Google Maps' Live View, which relies on traditional, handcrafted local features. Designing better algorithms to identify and describe local features will lead to better performance overall.

Comparing the performance of local feature algorithms, however, has been difficult, because it is not obvious how to collect "ground-truth" data for this purpose. Some computer vision tasks rely on crowdsourcing: Google's OpenImages dataset labels "objects" with bounding boxes or pixel masks, by combining machine learning techniques with human annotators. This is not possible in this case, as it is not known what constitutes a "good" local feature a priori, making labelling infeasible. Additionally, existing benchmarks such as HPatches, are often small or limited to a narrow range of transformations, which can bias the evaluation.

What matters is the quality of the reconstruction, and that benchmarks reflect real-world scale and challenges in order to highlight opportunities for developing new approaches. To this end, we have created the Image Matching Benchmark, the first benchmark to include a large dataset of images for training and evaluation. The dataset includes more than 25k images (sourced from the public YFCC100m dataset), each of which has been augmented with accurate pose information (location and orientation). We obtain this "pseudo" ground-truth from large-scale SfM (100s-1000s of images, for each scene), which provides accurate and stable poses, and then run our evaluation on smaller subsets (10s of images), a much more difficult problem. This approach does not require expensive sensors or human labelling, and it provides better proxy metrics than previous benchmarks, which were restricted to small and homogenous datasets.
Visualizations from our benchmark. We show point-to-point matches generated by different local feature algorithms. Left to right: SIFT, HardNet, LogPolarDesc, R2D2. For details, please refer to our website.
We hope this benchmark, dataset and challenge helps advance the state of the art in 3D reconstruction with heterogeneous images. If you’re interested in participating in the challenge, please see the 2020 Image Matching Challenge website for more details.

Acknowledgements
The benchmark is joint work by Yuhe Jin and Kwang Moo Yi (University of Victoria), Anastasiia Mishchuk and Pascal Fua (EPFL), Dmytro Mishkin and Jiří Matas (Czech Technical University), and Eduard Trulls (Google). The CVPR workshop is co-organized by Vassileios Balntas (Scape Technologies/Facebook), Vincent Lepetit (Ecole des Ponts ParisTech), Dmytro Mishkin and Jiří Matas (Czech Technical University), Johannes Schönberger (Microsoft), Eduard Trulls (Google), and Kwang Moo Yi (University of Victoria).

1 Please note that as of April 2, 2020, CVPR is currently on track, despite the COVID-19 pandemic. Challenge information will be updated as the situation develops. Please see the 2020 Image Matching Challenge website for details.