In efforts to learn about the quantum world, scientists face a big obstacle: their classical experience of the world. Whenever a quantum system is measured, the act of this measurement destroys the “quantumness” of the state. For example, if the quantum state is in a superposition of two locations, where it can seem to be in two places at the same time, once it is measured, it will randomly appear either ”here” or “there”, but not both. We only ever see the classical shadows cast by this strange quantum world.
A growing number of experiments are implementing machine learning (ML) algorithms to aid in analyzing data, but these have the same limitations as the people they aim to help: They can’t directly access and learn from quantum information. But what if there were a quantum machine learning algorithm that could directly interact with this quantum data?
In “Quantum Advantage in Learning from Experiments”, a collaboration with researchers at Caltech, Harvard, Berkeley, and Microsoft published in Science, we show that a quantum learning agent can perform exponentially better than a classical learning agent at many tasks. Using Google’s quantum computer, Sycamore, we demonstrate the tremendous advantage that a quantum machine learning (QML) algorithm has over the best possible classical algorithm. Unlike previous quantum advantage demonstrations, no advances in classical computing power could overcome this gap. This is the first demonstration of a provable exponential advantage in learning about quantum systems that is robust even on today's noisy hardware.
Quantum Speedup QML combines the best of both quantum computing and the lesser-known field of quantum sensing.
Quantum computers will likely offer exponential improvements over classical systems for certain problems, but to realize their potential, researchers first need to scale up the number of qubits and to improve quantum error correction. What’s more, the exponential speed-up over classical algorithms promised by quantum computers relies on a big, unproven assumption about so-called “complexity classes” of problems — namely, that the class of problems that can be solved on a quantum computer is larger than those that can be solved on a classical computer.. It seems like a reasonable assumption, and yet, no one has proven it. Until it's proven, every claim of quantum advantage will come with an asterisk: that it can do better than any known classical algorithm.
Quantum sensors, on the other hand, are already being used for some high-precision measurements and offer modest (and proven) advantages over classical sensors. Some quantum sensors work by exploiting quantum correlations between particles to extract more information about a system than it otherwise could have. For example, scientists can use a collection of N atoms to measure aspects of the atoms’ environment like the surrounding magnetic fields. Typically the sensitivity to the field that the atoms can measure scales with the square root of N. But if one uses quantum entanglement to create a complex web of correlations between the atoms, then one can improve the scaling to be proportional to N. But as with most quantum sensing protocols, this quadratic speed-up over classical sensors is the best one can ever do.
Enter QML, a technology that straddles the line between quantum computers and quantum sensors. QML algorithms make computations that are aided by quantum data. Instead of measuring the quantum state, a quantum computer can store quantum data and implement a QML algorithm to process the data without collapsing it. And when this data is limited, a QML algorithm can squeeze exponentially more information out of each piece it receives when considering particular tasks.
To see how a QML algorithm works, it’s useful to contrast with a standard quantum experiment. If a scientist wants to learn about a quantum system, they might send in a quantum probe, such as an atom or other quantum object whose state is sensitive to the system of interest, let it interact with the system, then measure the probe. They can then design new experiments or make predictions based on the outcome of the measurements. Classical machine learning (CML) algorithms follow the same process using an ML model, but the operating principle is the same — it’s a classical device processing classical information.
A QML algorithm instead uses an artificial “quantum learner.” After the quantum learner sends in a probe to interact with the system, it can choose to store the quantum state rather than measure it. Herein lies the power of QML. It can collect multiple copies of these quantum probes, then entangle them to learn more about the system faster.
Suppose, for example, the system of interest produces a quantum superposition state probabilistically by sampling from some distribution of possible states. Each state is composed of n quantum bits, or qubits, where each is a superposition of “0” and “1” — all learners are allowed to know the generic form of the state, but must learn its details.
In a standard experiment, where only classical data is accessible, every measurement provides a snapshot of the distribution of quantum states, but since it’s only a sample, it is necessary to measure many copies of the state to reconstruct it. In fact, it will take on the order of 2n copies.
A QML agent is more clever. By saving a copy of the n-qubit state, then entangling it with the next copy that comes along, it can learn about the global quantum state more quickly, giving a better idea of what the state looks like sooner.
The classical reconstruction is like trying to find an image hiding in a sea of noisy pixels — it could take a very long time to average-out all the noise to know what the image is representing. The quantum reconstruction, on the other hand, uses quantum mechanics to isolate the true image faster by looking for correlations between two different images at once.
Results To better understand the power of QML, we first looked at three different learning tasks and theoretically proved that in each case, the quantum learning agent would do exponentially better than the classical learning agent. Each task was related to the example given above:
In addition to the theoretical work, we ran some proof-of-principle experiments on the Sycamore quantum processor. We started by implementing a QML algorithm to perform the first task. We fed an unknown quantum mixed state to the algorithm, then asked which of two observables of the state was larger. After training the neural network with simulation data, we found that the quantum learning agent needed exponentially fewer experiments to reach a prediction accuracy of 70% — equating to 10,000 times fewer measurements when the system size was 20 qubits. The total number of qubits used was 40 since two copies were stored at once.
In a second experiment, relating to the task 3 above, we had the algorithm learn about the symmetry of an operator that evolves the quantum state of their qubits. In particular, if a quantum state might undergo evolution that is either totally random or random but also time-reversal symmetric, it can be difficult for a classical learner to tell the difference. In this task, the QML algorithm can separate the operators into two distinct categories, representing two different symmetry classes, while the CML algorithm fails outright. The QML algorithm was completely unsupervised, so this gives us hope that the approach could be used to discover new phenomena without needing to know the right answer beforehand.
Conclusion This experimental work represents the first demonstrated exponential advantage in quantum machine learning. And, distinct from a computational advantage, when limiting the number of samples from the quantum state, this type of quantum learning advantage cannot be challenged, even by unlimited classical computing resources.
So far, the technique has only been used in a contrived, “proof-of-principle” experiment, where the quantum state is deliberately produced and the researchers pretend not to know what it is. To use these techniques to make quantum-enhanced measurements in a real experiment, we’ll first need to work on current quantum sensor technology and methods to faithfully transfer quantum states to a quantum computer. But the fact that today’s quantum computers can already process this information to squeeze out an exponential advantage in learning bodes well for the future of quantum machine learning.
AcknowledgementsWe would like to thank our Quantum Science Communicator Katherine McCormick for writing this blog post. Images reprinted with permission from Huang et al., Science, Vol 376:1182 (2022).
Over four billion people live in cities around the globe, and while most people interact daily with others — at the grocery store, on public transit, at work — they may take for granted their frequent interactions with the diverse plants and animals that comprise fragile urban ecosystems. Trees in cities, called urban forests, provide critical benefits for public health and wellbeing and will prove integral to urban climate adaptation. They filter air and water, capture stormwater runoff, sequester atmospheric carbon dioxide, and limit erosion and drought. Shade from urban trees reduces energy-expensive cooling costs and mitigates urban heat islands. In the US alone, urban forests cover 127M acres and produce ecosystem services valued at $18 billion. But as the climate changes these ecosystems are increasingly under threat.
Urban forest monitoring — measuring the size, health, and species distribution of trees in cities over time — allows researchers and policymakers to (1) quantify ecosystem services, including air quality improvement, carbon sequestration, and benefits to public health; (2) track damage from extreme weather events; and (3) target planting to improve robustness to climate change, disease and infestation.
However, many cities lack even basic data about the location and species of their trees. Collecting such data via a tree census is costly (a recent Los Angeles census cost $2 million and took 18 months) and thus is typically conducted only by cities with substantial resources. Further, lack of access to urban greenery is a key aspect of urban social inequality, including socioeconomic and racial inequality. Urban forest monitoring enables the quantification of this inequality and the pursuit of its improvement, a key aspect of the environmental justice movement. But machine learning could dramatically lower tree census costs using a combination of street-level and aerial imagery. Such an automated system could democratize access to urban forest monitoring, especially for under-resourced cities that are already disproportionately affected by climate change. While there have been prior efforts to develop automated urban tree species recognition from aerial or street-level imagery, a major limitation has been a lack of large-scale labeled datasets.
Today we introduce the Auto Arborist Dataset, a multiview urban tree classification dataset that, at ~2.6 million trees and >320 genera, is two orders of magnitude larger than those in prior work. To build the dataset, we pulled from public tree censuses from 23 North American cities (shown above) and merged these records with Street View and overhead RGB imagery. As the first urban forest dataset to cover multiple cities, we analyze in detail how forest models can generalize with respect to geographic distribution shifts, crucial to building systems that scale. We are releasing all 2.6M tree records publicly, along with aerial and ground-level imagery for 1M trees.
The Auto Arborist Dataset To curate Auto Arborist, we started from existing tree censuses which are provided by many cities online. For each tree census considered, we verified that the data contained GPS locations and genus/species labels, and was available for public use. We then parsed these data into a common format, fixing common data entry errors (such as flipped latitude/longitude) and mapping ground-truth genus names (and their common misspellings or alternate names) to a unified taxonomy. We have chosen to focus on genus prediction (instead of species-level prediction) as our primary task to avoid taxonomic complexity arising from hybrid and subspecies and the fact that there is more universal consensus on genus names than species names.
Next, using the provided geolocation for each tree, we queried an RGB aerial image centered on the tree and all street-level images taken within 2-10 meters around it. Finally, we filtered these images to (1) maximize our chances that the tree of interest is visible from each image and (2) preserve user privacy. This latter concern involved a number of steps including the removal of images that included people as determined by semantic segmentation and manual blurring, among others.
One of the most important challenges for urban forest monitoring is to do well in cities that were not part of the training set. Vision models must contend with distribution shifts, where the training distribution differs from the test distribution from a new city. Genus distributions vary geographically (e.g., there are more Douglas fir in western Canada than in California) and can also vary based on city size (LA is much larger than Santa Monica and contains many more genera). Another challenge is the long-tailed, fine-grained nature of tree genera, which can be difficult to disambiguate even for human experts, with many genera being quite rare.
Finally, there are a number of ways in which tree images can have noise. For one, there is temporal variation in deciduous trees (for example, when aerial imagery includes leaves, but street-level images are bare). Moreover, public arboreal censuses are not always up-to-date. Thus, sometimes trees have died (and are no longer visible) in the time since the tree census was taken. In addition, aerial data quality can be poor (missing or obscured, e.g., by clouds).
Our curation process sought to minimize these issues by (1) only keeping images with sufficient tree pixels, as determined by a semantic segmentation model, (2) only keeping reasonably recent images, and (3) only keeping images where the tree position was sufficiently close to the street level camera. We considered also optimizing for trees seen in spring and summer, but decided seasonal variation could be a useful cue — we thus also released the date of each image to enable the community to explore the effects of seasonal variability.
Benchmark and Evaluation To evaluate the dataset, we designed a benchmark to measure domain generalization and performance in the long tail of the distribution. We generated training and test splits at three levels. First, we split within each city (based on latitude or longitude) to see how well a city generalizes to itself. Second, we aggregate city-level training sets into three regions, West, Central, and East, holding out one city from each region. Finally, we merge the training sets across the three regions. For each of these splits, we report both accuracy and class-averaged recall for frequent, common and rare species on the corresponding held-out test sets.
Using these metrics, we establish a performance baseline using standard modern convolutional models (ResNet). Our results demonstrate the benefits of a large-scale, geospatially distributed dataset such as Auto Arborist. First, we see that more training data helps — training on the entire dataset is better than training on a region, which is better than training on a single city.
Second, training on similar cities helps (and thus, having more coverage of cities helps). For example, if focusing on Seattle, then it is better to train on trees in Vancouver than Pittsburgh.
Third, more data modalities and views help. The best performing models combine inputs from multiple Street View angles and overhead views. There remains much room for improvement, however, and this is where we believe the larger community of researchers can help.
Get Involved By releasing the Auto Arborist Dataset, we step closer to the goal of affordable urban forest monitoring, enabling the computer vision community to tackle urban forest monitoring at scale for the first time. In the future, we hope to expand coverage to more North American cities (particularly in the South of the US and Mexico) and even worldwide. Further, we are excited to push the dataset to the more fine-grained species level and investigate more nuanced monitoring, including monitoring tree health and growth over time, and studying the effects of environmental factors on urban forests.
For more details, see our CVPR 2022 paper. This dataset is part of Google's broader efforts to empower cities with data about urban forests, through the Environmental Insights Explorer Tree Canopy Lab and is available on our GitHub repo. If you represent a city that is interested in being included in the dataset please email auto-arborist+managers@googlegroups.com.
Acknowledgements We would like to thank our co-authors Guanhang Wu, Trevor Edwards, Filip Pavetic, Bo Majewski, Shreyasee Mukherjee, Stanley Chan, John Morgan, Vivek Rathod, and Chris Bauer. We also thank Ruth Alcantara, Tanya Birch, and Dan Morris from Google AI for Nature and Society, John Quintero, Stafford Marquardt, Xiaoqi Yin, Puneet Lall, and Matt Manolides from Google Geo, Karan Gill, Tom Duerig, Abhijit Kundu, David Ross, Vighnesh Birodkar, Yiwen Luo and Lily Pagan from Google Research (Perception team), and Pietro Perona for their support. This work was supported in part by the Resnick Sustainability Institute and was undertaken while Sara Beery was a Student Researcher at Google.
This week marks the beginning of the premier annual Computer Vision and Pattern Recognition conference (CVPR 2022), held both in-person in New Orleans, LA and virtually. As a leader in computer vision research and a Platinum Sponsor, Google will have a strong presence across CVPR 2022 with over 80 papers being presented at the main conference and active involvement in a number of conference workshops and tutorials.
If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively exploring the latest machine learning techniques for application to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including on-device ML applications with MediaPipe, the Auto Arborist Dataset for urban forest monitoring, and much more.
You can also learn more about our research being presented at CVPR 2022 in the list below (Google affiliations in bold).
Organizing Committee
Tutorials Chairs Include: Boqing Gong
Website Chairs Include: AJ Piergiovanni
Area Chairs Include: Alireza Fathi, Cordelia Schmid, Deqing Sun, Jonathan Barron, Michael Ryoo, Supasorn Suwajanakorn, Susanna Ricco
Diversity, Equity, and Inclusion Chairs Include: Noah Snavely
Panel Discussion: Embodied Computer Vision Panelists include: Michael Ryoo
Publications
Learning to Prompt for Continual Learning (see blog post) Zifeng Wang*, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, Tomas Pfister
GCR: Gradient Coreset Based Replay Buffer Selection for Continual Learning Rishabh Tiwari, Krishnateja Killamsetty, Rishabh Iyer, Pradeep Shenoy
Zero-Shot Text-Guided Object Generation with Dream Fields Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, Ben Poole
Towards End-to-End Unified Scene Text Detection and Layout Analysis Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessandro Bissacco, Yasuhisa Fujii, Michalis Raptis
FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing Rishubh Singh, Pranav Gupta, Pradeep Shenoy, Ravikiran Sarvadevabhatla
LOLNerf: Learn from One Look Daniel Rebain, Mark Matthews, Kwang Moo Yi, Dmitry Lagun, Andrea Tagliasacchi
Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing Thiemo Alldieck, Mihai Zanfir, Cristian Sminchisescu
Learning Local Displacements for Point Cloud Completion Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari
Density-Preserving Deep Point Cloud Compression Yun He, Xinlin Ren, Danhang Tang, Yinda Zhang, Xiangyang Xue, Yanwei Fu
CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation Qihang Yu*, Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
Deformable Sprites for Unsupervised Video Decomposition Vickie Ye, Zhengqi Li, Richard Tucker, Angjoo Kanazawa, Noah Snavely
Learning with Neighbor Consistency for Noisy Labels Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid
Multiview Transformers for Video Recognition Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid
Kubric: A Scalable Dataset Generator Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan*, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
3D Moments from Near-Duplicate Photos Qianqian Wang, Zhengqi Li, David Salesin, Noah Snavely, Brian Curless, Janne Kontkanen
Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, Peter Hedman
RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs Michael Niemeyer*, Jonathan T. Barron, Ben Mildenhall, Mehdi S. M. Sajjadi, Andreas Geiger, Noha Radwan*
Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T. Barron, Pratul P. Srinivasan
IRON: Inverse Rendering by Optimizing Neural SDFs and Materials from Photometric Images Kai Zhang, Fujun Luan, Zhengqi Li, Noah Snavely
MAXIM: Multi-Axis MLP for Image Processing Zhengzhong Tu*, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li
Restormer: Efficient Transformer for High-Resolution Image Restoration Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang
Burst Image Restoration and Enhancement Akshay Dudhane, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, Ming-Hsuan Yang
Neural RGB-D Surface Reconstruction Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, Justus Thies
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan*, Suhani Vora, Mario Lučić, Daniel Duckworth, Alexey Dosovitskiy*, Jakob Uszkoreit*, Thomas Funkhouser, Andrea Tagliasacchi*
ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose Estimation Yongzhi Su, Mahdi Saleh, Torben Fetzer, Jason Rambach, Nassir Navab, Benjamin Busam, Didier Stricker, Federico Tombari
MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision Ben Usman, Andrea Tagliasacchi, Kate Saenko, Avneesh Sud
GPV-Pose: Category-Level Object Pose Estimation via Geometry-Guided Point-wise Voting Yan Di, Ruida Zhang, Zhiqiang Lou, Fabian Manhardt, Xiangyang Ji, Nassir Navab, Federico Tombari
Rethinking Deep Face Restoration Yang Zhao*, Yu-Chuan Su, Chun-Te Chu, Yandong Li, Marius Renn, Yukun Zhu, Changyou Chen, Xuhui Jia
Transferability Metrics for Selecting Source Model Ensembles Andrea Agostinelli, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari
Robust Fine-Tuning of Zero-Shot Models Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, Ludwig Schmidt
Block-NeRF: Scalable Large Scene Neural View Synthesis Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, Henrik Kretzschmar
Light Field Neural Rendering Mohammad Suhail*, Carlos Esteves, Leonid Sigal, Ameesh Makadia
Transferability Estimation Using Bhattacharyya Class Separability Michal Pándy, Andrea Agostinelli, Jasper Uijlings, Vittorio Ferrari, Thomas Mensink
Matching Feature Sets for Few-Shot Image Classification Arman Afrasiyabi, Hugo Larochelle, Jean-François Lalonde, Christian Gagné
Which Model to Transfer? Finding the Needle in the Growing Haystack Cedric Renggli, André Susano Pinto, Luka Rimanic, Joan Puigcerver, Carlos Riquelme, Ce Zhang, Mario Lučić
Auditing Privacy Defenses in Federated Learning via Generative Gradient Leakage Zhuohang Li, Jiaxin Zhang, Luyang Liu, Jian Liu
Estimating Example Difficulty Using Variance of Gradients Chirag Agarwal, Daniel D'souza, Sara Hooker
More Than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech (see blog post) Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford, Miaosen Wang, Ye Jia, Tal Remez
Robust Outlier Detection by De-Biasing VAE Likelihoods Kushal Chauhan, Barath Mohan U, Pradeep Shenoy, Manish Gupta, Devarajan Sridharan
Deep 3D-to-2D Watermarking: Embedding Messages in 3D Meshes and Extracting Them from 2D Renderings Innfarn Yoo, Huiwen Chang, Xiyang Luo, Ondrej Stava, Ce Liu*, Peyman Milanfar, Feng Yang
Knowledge Distillation: A Good Teacher Is Patient and Consistent Lucas Beyer, Xiaohua Zhai, Amélie Royer*, Larisa Markeeva*, Rohan Anil, Alexander Kolesnikov
Urban Radiance Fields Konstantinos Rematas, Andrew Liu, Pratul P. Srinivasan, Jonathan T. Barron, Andrea Tagliasacchi, Thomas Funkhouser, Vittorio Ferrari
Manifold Learning Benefits GANs Yao Ni, Piotr Koniusz, Richard Hartley, Richard Nock
MaskGIT: Masked Generative Image Transformer Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu*, William T. Freeman
InOut: Diverse Image Outpainting via GAN Inversion Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Ming-Hsuan Yang
Scaling Vision Transformers (see blog post) Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer
Fine-Tuning Image Transformers Using Learnable Memory Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Andrew Jackson
PokeBNN: A Binary Pursuit of Lightweight Accuracy Yichi Zhang*, Zhiru Zhang, Lukasz Lew
Bending Graphs: Hierarchical Shape Matching Using Gated Optimal Transport Mahdi Saleh, Shun-Cheng Wu, Luca Cosmo, Nassir Navab, Benjamin Busam, Federico Tombari
Uncertainty-Aware Deep Multi-View Photometric Stereo Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, Luc Van Gool
Depth-Supervised NeRF: Fewer Views and Faster Training for Free Kangle Deng, Andrew Liu, Jun-Yan Zhu, Deva Ramanan
Dense Depth Priors for Neural Radiance Fields from Sparse Input Views Barbara Roessle, Jonathan T. Barron, Ben Mildenhall, Pratul P. Srinivasan, Matthias Nießner
Trajectory Optimization for Physics-Based Reconstruction of 3D Human Pose from Monocular Video Erik Gärtner, Mykhaylo Andriluka, Hongyi Xu, Cristian Sminchisescu
Differentiable Dynamics for Articulated 3D Human Motion Reconstruction Erik Gärtner, Mykhaylo Andriluka, Erwin Coumans, Cristian Sminchisescu
Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas J. Guibas, Andrea Tagliasacchi, Frank Dellaert, Thomas Funkhouser
Pyramid Adversarial Training Improves ViT Performance Charles Herrmann, Kyle Sargent, Lu Jiang, Ramin Zabih, Huiwen Chang, Ce Liu*, Dilip Krishnan, Deqing Sun
Proper Reuse of Image Classification Features Improves Object Detection Cristina Vasconcelos, Vighnesh Birodkar, Vincent Dumoulin
SOMSI: Spherical Novel View Synthesis with Soft Occlusion Multi-Sphere Images Tewodros Habtegebrial, Christiano Gava, Marcel Rogge, Didier Stricker, Varun Jampani
TubeFormer-DeepLab: Video Mask Transformer Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen
Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision Liangzhe Yuan, Rui Qian*, Yin Cui, Boqing Gong, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu
When Does Contrastive Visual Representation Learning Work? Elijah Cole, Xuan Yang, Kimberly Wilber, Oisin Mac Aodha, Serge Belongie
Less Is More: Generating Grounded Navigation Instructions from Landmarks Su Wang, Ceslee Montgomery, Jordi Orbay, Vighnesh Birodkar, Aleksandra Faust, Izzeddin Gur, Natasha Jaques, Austin Waters, Jason Baldridge, Peter Anderson
Forecasting Characteristic 3D Poses of Human Actions Christian Diller, Thomas Funkhouser, Angela Dai
BEHAVE: Dataset and Method for Tracking Human Object Interactions Bharat Lal Bhatnagar, Xianghui Xie, Ilya A. Petrov, Cristian Sminchisescu, Christian Theobalt, Gerard Pons-Moll
Motion-from-Blur: 3D Shape and Motion Estimation of Motion-Blurred Objects in Videos Denys Rozumnyi, Martin R. Oswald, Vittorio Ferrari, Marc Pollefeys
End-to-End Generative Pretraining for Multimodal Video Captioning (see blog post) Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation Jogendra Nath Kundu, Siddharth Seth, Pradyumna YM, Varun Jampani, Anirban Chakraborty, R. Venkatesh Babu
Learning ABCs: Approximate Bijective Correspondence for Isolating Factors of Variation with Weak Supervision Kieran A. Murphy, Varun Jampani, Srikumar Ramalingam, Ameesh Makadia
HumanNeRF: Free-Viewpoint Rendering of Moving People from Monocular Video Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, Ira Kemelmacher-Shlizerman
Deblurring via Stochastic Refinement Jay Whang*, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G. Dimakis, Peyman Milanfar
NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images Ben Mildenhall, Peter Hedman, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron
CoNeRF: Controllable Neural Radiance Fields Kacper Kania, Kwang Moo Yi, Marek Kowalski, Tomasz Trzciński, Andrea Tagliasacchi
A Conservative Approach for Unbiased Learning on Unknown Biases Myeongho Jeon, Daekyung Kim, Woochul Lee, Myungjoo Kang, Joonseok Lee
DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection (see blog post) Yingwei Li*, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Yifeng Lu, Denny Zhou, Quoc V. Le, Alan Yuille, Mingxing Tan
Video Frame Interpolation Transformer Zhihao Shi, Xiangyu Xu, Xiaohong Liu, Jun Chen, Ming-Hsuan Yang
Global Matching with Overlapping Attention for Optical Flow Estimation Shiyu Zhao, Long Zhao, Zhixing Zhang, Enyu Zhou, Dimitris Metaxas
LiT: Zero-Shot Transfer with Locked-image Text Tuning (see blog post) Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer
Are Multimodal Transformers Robust to Missing Modality? Mengmeng Ma, Jian Ren, Long Zhao, Davide Testuggine, Xi Peng
3D-VField: Adversarial Augmentation of Point Clouds for Domain Generalization in 3D Object Detection Alexander Lehner, Stefano Gasperini, Alvaro Marcos-Ramiro, Michael Schmidt, Mohammad-Ali Nikouei Mahani, Nassir Navab, Benjamin Busam, Federico Tombari
SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation Tao Sun, Mattia Segu, Janis Postels, Yuxuan Wang, Luc Van Gool, Bernt Schiele, Federico Tombari, Fisher Yu
H4D: Human 4D Modeling by Learning Neural Compositional Representation Boyan Jiang, Yinda Zhang, Xingkui Wei, Xiangyang Xue, Yanwei Fu
Gravitationally Lensed Black Hole Emission Tomography Aviad Levis, Pratul P. Srinivasan, Andrew A. Chael, Ren Ng, Katherine L. Bouman
Deep Saliency Prior for Reducing Visual Distraction Kfir Aberman, Junfeng He, Yossi Gandelsman, Inbar Mosseri, David E. Jacobs, Kai Kohlhoff, Yael Pritch, Michael Rubinstein
The Auto Arborist Dataset: A Large-Scale Benchmark for Multiview Urban Forest Monitoring Under Domain Shift Sara Beery, Guanhang Wu, Trevor Edwards, Filip Pavetic, Bo Majewski, Shreyasee Mukherjee, Stanley Chan, John Morgan, Vivek Rathod, Jonathan Huang
Workshops
Ethical Considerations in Creative Applications of Computer Vision Chairs and Advisors: Negar Rostamzadeh, Fernando Diaz, Emily Denton, Mark Diaz, Jason Baldridge
Dynamic Neural Networks Meet Computer Vision Organizers Invited Speaker: Barret Zoph
Precognition: Seeing Through the Future Organizer: Utsav Prabhu Invited Speaker: Sella Nevo
Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings Invited Speakers: Thomas Funkhouser, Federico Tombari
Neural Architecture Search: Lightweight NAS Challenge Invited Speaker: Barret Zoph
Transformers in Vision Organizer: Lucas Beyer Invited Speakers and Panelists: Alexander Kolesnikov, Mathilde Caron, Arsha Nagrani, Lucas Beyer
Challenge on Learned Image Compression Organizers: George Toderici, Johannes Balle, Eirikur Agustsson, Nick Johnston, Fabian Mentzer, Luca Versari Invited Speaker: Debargha Mukherjee
Embodied AI Organizers: Anthony Francis, Sören Pirk, Alex Ku, Fei Xia, Peter Anderson Scientific Advisory Board Members: Alexander Toshev, Jie Tan Invited Speaker: Carolina Parada
Sight and Sound Organizers: Arsha Nagrani, William Freeman
New Trends in Image Restoration and Enhancement Organizers: Ming-Hsuan Yang, Vivek Kwatra, George Toderici
EarthVision: Large Scale Computer Vision for Remote Sensing Imagery Invited Speaker: John Quinn
LatinX in Computer Vision Research Organizer: Ruben Villegas
Fine-Grained Visual Categorization Organizer: Kimberly Wilber
The Art of Robustness: Devil and Angel in Adversarial Machine Learning Organizer: Florian Tramèr Invited Speaker: Nicholas Carlini
AI for Content Creation Organizers: Deqing Sun, Huiwen Chang, Lu Jiang Invited Speaker: Chitwan Saharia
LOng-form VidEo Understanding Invited Speaker: Cordelia Schmid
Visual Perception and Learning in an Open World Invited Speaker: Rahul Sukthankar
Media Forensics Organizer : Christoph Bregler Technical Committee Members: Shruti Agarwal, Scott McCloskey, Peng Zhou
Vision Datasets Understanding Organizer: José Lezama
Embedded Vision Invited Speaker: Matthias Grundmann
Federated Learning for Computer Vision Invited Speaker: Zheng Xu
Large Scale Holistic Video Understanding Organizer: David Ross Invited Speaker: Anurag Arnab
Learning With Limited Labelled Data for Image and Video Understanding Invited Speaker: Hugo Larochelle
Bridging the Gap Between Computational Photography and Visual Recognition Invited Speaker: Xiaohua Zhai
Explainable Artificial Intelligence for Computer Vision Invited Speaker: Been Kim
Robustness in Sequential Data Organizers: Sayna Ebrahimi, Kevin Murphy Invited Speakers: Sayna Ebrahimi, Balaji Lakshminarayanan
Sketch-Oriented Deep Learning Organizer: David Ha Invited Speaker: Jonas Jongejan
Multimodal Learning and Applications Invited Speaker: Cordelia Schmid
Computational Cameras and Displays Organizer: Tali Dekel Invited Speaker: Peyman Millanfar
Artificial Social Intelligence Invited Speaker: Natasha Jaques
VizWiz Grand Challenge: Algorithms to Assist People Who Are Blind Invited Speaker and Panelist: Andrew Howard
Image Matching: Local Features & Beyond Organizer: Eduard Trulls
Multi-Agent Behavior: Representation, Modeling, Measurement, and Applications Organizer: Ting Liu
Efficient Deep Learning for Computer Vision Organizers: Pete Warden, Andrew Howard, Grace Chu, Jaeyoun Kim
Gaze Estimation and Prediction in the Wild Organizer: Thabo Beeler
Tutorials
Denoising Diffusion-Based Generative Modeling: Foundations and Applications Invited Speaker: Ruiqi Gao
Algorithmic Fairness: Why It's Hard and Why It's Interesting Invited Speaker: Sanmi Koyejo
Beyond Convolutional Neural Networks Invited Speakers: Neil Houlsby, Alexander Kolesnikov, Xiaohua Zhai
Joint Ego4D and Egocentric Perception, Interaction & Computing Invited Speaker: Vittorio Ferrari
Deep AUC Maximization Invited Speakers: Tianbao Yang
Vision-Based Robot Learning Organizers: Michael S. Ryoo, Andy Zeng, Pete Florence
Graph Machine Learning for Visual Computing Organizers: Federico Tombari Invited Speakers: Federico Tombari, Fabian Manhardt
Many recent advances in computer vision and robotics rely on deep learning, but training deep learning models requires a wide variety of data to generalize to new scenarios. Historically, deep learning for computer vision has relied on datasets with millions of items that were gathered by web scraping, examples of which include ImageNet, Open Images, YouTube-8M, and COCO. However, the process of creating these datasets can be labor-intensive, and can still exhibit labeling errors that can distort the perception of progress. Furthermore, this strategy does not readily generalize to arbitrary three-dimensional shapes or real-world robotic data.
Simulating robots and environments using tools such as Gazebo, MuJoCo, and Unity can mitigate many of the inherent limitations in these datasets. However, simulation is only an approximation of reality — handcrafted models built from polygons and primitives often correspond poorly to real objects. Even if a scene is built directly from a 3D scan of a real environment, the movable objects in that scan will act like fixed background scenery and will not respond the way real-world objects would. Due to these challenges, there are few large libraries with high-quality models of 3D objects that can be incorporated into physical and visual simulations to provide the variety needed for deep learning.
In “Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items”, presented at ICRA 2022, we describe our efforts to address this need by creating the Scanned Objects dataset, a curated collection of over 1000 3D-scanned common household items. The Scanned Objects dataset is usable in tools that read Simulation Description Format (SDF) models, including the Gazebo and PyBullet robotics simulators. Scanned Objects is hosted on Open Robotics, an open-source hosting environment for models compatible with the Gazebo simulator.
History Robotics researchers within Google began scanning objects in 2011, creating high-fidelity 3D models of common household objects to help robots recognize and grasp things in their environments. However, it became apparent that 3D models have many uses beyond object recognition and robotic grasping, including scene construction for physical simulations and 3D object visualization for end-user applications. Therefore, this Scanned Objects project was expanded to bring 3D experiences to Google at scale, collecting a large number of 3D scans of household objects through a process that is more efficient and cost effective than traditional commercial-grade product photography.
Scanned Objects was an end-to-end effort, involving innovations at nearly every stage of the process, including curation of objects at scale for 3D scanning, the development of novel 3D scanning hardware, efficient 3D scanning software, fast 3D rendering software for quality assurance, and specialized frontends for web and mobile viewers. We also executed human-computer interaction studies to create effective experiences for interacting with 3D objects.
These object models proved useful in 3D visualizations for Everyday Robots, which used the models to bridge the sim-to-real gap for training, work later published as RetinaGAN and RL-CycleGAN. Building on these earlier 3D scanning efforts, in 2019 we began preparing an external version of the Scanned Objects dataset and transforming the previous set of 3D images into graspable 3D models.
Object Scanning To create high-quality models, we built a scanning rig to capture images of an object from multiple directions under controlled and carefully calibrated conditions. The system consists of two machine vision cameras for shape detection, a DSLR camera for high-quality HDR color frame extraction, and a computer-controlled projector for pattern recognition. The scanning rig uses a structured light technique that infers a 3D shape from camera images with patterns of light that are projected onto an object.
Simulation Model Conversion The early internal scanned models used protocol buffer metadata, high-resolution visuals, and formats that were not suitable for simulation. For some objects, physical properties, such as mass, were captured by weighing the objects at scanning time, but surface properties, such as friction or deformation, were not represented.
So, following data collection, we built an automated pipeline to solve these issues and enable the use of scanned models in simulation systems. The automated pipeline filters out invalid or duplicate objects, automatically assigns object names using text descriptions of the objects, and eliminates object mesh scans that do not meet simulation requirements. Next, the pipeline estimates simulation properties (e.g., mass and moment of inertia) from shape and volume, constructs collision volumes, and downscales the model to a usable size. Finally, the pipeline converts each model to SDF format, creates thumbnail images, and packages the model for use in simulation systems.
The output of this pipeline is a simulation model in an appropriate format with a name, mass, friction, inertia, and collision information, along with searchable metadata in a public interface compatible with our open-source hosting on Open Robotics’ Gazebo.
The output objects are represented as SDF models that refer to Wavefront OBJ meshes averaging 1.4 Mb per model. Textures for these models are in PNG format and average 11.2 Mb. Together, these provide high resolution shape and texture.
Impact The Scanned Objects dataset contains 1030 scanned objects and their associated metadata, totaling 13 Gb, licensed under the CC-BY 4.0 License. Because these models are scanned rather than modeled by hand, they realistically reflect real object properties, not idealized recreations, reducing the difficulty of transferring learning from simulation to the real world.
The Scanned Objects dataset has already been used in over 25 papers across as many projects, spanning computer vision, computer graphics, robot manipulation, robot navigation, and 3D shape processing. Most projects used the dataset to provide synthetic training data for learning algorithms. For example, the Scanned Objects dataset was used in Kubric, an open-sourced generator of scalable datasets for use in over a dozen vision tasks, and in LAX-RAY, a system for searching shelves with lateral access X-rays to automate the mechanical search for occluded objects on shelves.
We hope that the Scanned Objects dataset will be used by more robotics and simulation researchers in the future, and that the example set by this dataset will inspire other owners of 3D model repositories to make them available for researchers everywhere. If you would like to try it yourself, head to Gazebo and start browsing!
Acknowledgments The authors thank the Scanned Objects team, including Peter Anderson-Sprecher, J.J. Blumenkranz, James Bruce, Ken Conley, Katie Dektar, Charles DuHadway, Anthony Francis, Chaitanya Gharpure, Topraj Gurung, Kristy Headley, Ryan Hickman, John Isidoro, Sumit Jain, Brandon Kinman, Greg Kline, Mach Kobayashi, Nate Koenig, Kai Kohlhoff, James Kuffner, Thor Lewis, Mike Licitra, Lexi Martin, Julian (Mac) Mason, Rus Maxham, Pascal Muetschard, Kannan Pashupathy, Barbara Petit, Arshan Poursohi, Jared Russell, Matt Seegmiller, John Sheu, Joe Taylor, Vincent Vanhoucke, Josh Weaver, and Tommy McHugh.
Special thanks go to Krista Reymann for organizing this project, helping write the paper, and editing this blogpost, James Bruce for the scanning pipeline design and Pascal Muetschard for maintaining the database of object models.