Advances in reinforcement learning (RL) for robotics have enabled robotic agents to perform increasingly complex tasks in challenging environments. Recent results show that robots can learn to fold clothes, dexterously manipulate a rubik’s cube, sort objects by color, navigate complex environments and walk on difficult, uneven terrain. But "short-horizon" tasks such as these, which require very little long-term planning and provide immediate failure feedback, are relatively easy to train compared to many tasks that may confront a robot in a real-world setting. Unfortunately, scaling such short-horizon skills to the abstract, long horizons of real-world tasks is difficult. For example, how would one train a robot capable of picking up objects to rearrange a room?
Hierarchical reinforcement learning (HRL), a popular way of solving this problem, has achieved some success in a variety of long-horizon RL tasks. HRL aims to solve such problems by reasoning over a bank of low-level skills, thus providing an abstraction for actions. However, the high-level planning problem can be further simplified by abstracting both states and actions. For example, consider a tabletop rearrangement task, where a robot is tasked with interacting with objects on a desk. Using recent advances in RL, imitation learning, and unsupervised skill discovery, it is possible to obtain a set of primitive manipulation skills such as opening or closing drawers, picking or placing objects, etc. However, even for the simple task of putting a block into the drawer, chaining these skills together is not straightforward. This may be attributed to a combination of (i) challenges with planning and reasoning over long horizons, and (ii) dealing with high dimensional observations while parsing the semantics and affordances of the scene, i.e., where and when the skill can be used.
In “Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning”, presented at ICLR 2022, we address the task of learning suitable state and action abstractions for long-range problems. We posit that a minimal, but complete, representation for a higher-level policy in HRL must depend on the capabilities of the skills available to it. We present a simple mechanism to obtain such a representation using skill value functions and show that such an approach improves long-horizon performance in both model-based and model-free RL and enables better zero-shot generalization.
Building a Value Function Space The key insight motivating this work is that the abstract representation of actions and states is readily available from trained policies via their value functions. The notion of “value” in RL is intrinsically linked to affordances, in that the value of a state for skill reflects the probability of receiving a reward for successfully executing the skill. For any skill, its value function captures two key properties: 1) the preconditions and affordances of the scene, i.e., where and when the skill can be used, and 2) the outcome, which indicates whether the skill executed successfully when it was used.
Given a decision process with a finite set of k skills trained with sparse outcome rewards and their corresponding value functions, we construct an embedding space by stacking these skill value functions. This gives us an abstract representation that maps a state to a k-dimensional representation that we call the Value Function Space, or VFS for short. This representation captures functional information about the exhaustive set of interactions that the agent can have with the environment, and is thus a suitable state abstraction for downstream tasks.
Consider a toy example of the tabletop rearrangement setup discussed earlier, with the task of placing the blue object in the drawer. There are eight elementary actions in this environment. The bar plot on the right shows the values of each skill at any given time, and the graph at the bottom shows the evolution of these values over the course of the task.
At the beginning, the values corresponding to the “Place on Counter” skill are high since the objects are already on the counter; likewise, the values corresponding to “Close Drawer” are high. Through the trajectory, when the robot picks up the blue cube, the corresponding skill value peaks. Similarly, the values corresponding to placing the objects in the drawer increase when the drawer is open and peak when the blue cube is placed inside it. All the functional information required to affect each transition and predict its outcome (success or failure) is captured by the VFS representation, and in principle, allows a high-level agent to reason over all the skills and chain them together — resulting in an effective representation of the observations.
Additionally, since VFS learns a skill-centric representation of the scene, it is robust to exogenous factors of variation, such as background distractors and appearances of task-irrelevant components of the scene. All configurations shown below are functionally equivalent — an open drawer with the blue cube in it, a red cube on the countertop, and an empty gripper — and can be interacted with identically, despite apparent differences.
Robotic Manipulation with VFS This approach enables VFS to plan out complex robotic manipulation tasks. Take, for example, a simple model-based reinforcement learning (MBRL) algorithm that uses a simple one-step predictive model of the transition dynamics in value function space and randomly samples candidate skill sequences to select and execute the best one in a manner similar to the model-predictive control. Given a set of primitive pushing skills of the form “move Object A near Object B” and a high-level rearrangement task, we find that VFS can use MBRL to reliably find skill sequences that solve the high-level task.
To better understand the attributes of the environment captured by VFS, we sample the VFS-encoded observations from a large number of independent trajectories in the robotic manipulation task and project them onto a two-dimensional axis using the t-SNE technique, which is useful for visualizing clusters in high-dimensional data. These t-SNE embeddings reveal interesting patterns identified and modeled by VFS. Looking at some of these clusters closely, we find that VFS can successfully capture information about the contents (objects) in the scene and affordances (e.g., a sponge can be manipulated when held by the robot’s gripper), while ignoring distractors like the relative positions of the objects on the table and the pose of the robotic arm. While these factors are certainly important to solve the task, the low-level primitives available to the robot abstract them away and hence, make them functionally irrelevant to the high-level controller.
Conclusions and Connections to Future Work Value function spaces are representations built on value functions of underlying skills, enabling long-horizon reasoning and planning over skills. VFS is a compact representation that captures the affordances of the scene and task-relevant information while robustly ignoring distractors. Empirical experiments reveal that such a representation improves planning for model-based and model-free methods and enables zero-shot generalization. Going forward, this representation has the promise to continue improving along with the field of multitask reinforcement learning. The interpretability of VFS further enables integration into fields such as safe planning and grounding language models.
Acknowledgements We thank our co-authors Sergey Levine, Ted Xiao, Alex Toshev, Peng Xu and Yao Lu for their contributions to the paper and feedback on this blog post. We also thank Tom Small for creating the informative visualizations used in this blog post.
The 10th International Conference on Learning Representations (ICLR 2022) kicks off this week, bringing together researchers, entrepreneurs, engineers and students alike to discuss and explore the rapidly advancing field of deep learning. Entirely virtual this year, ICLR 2022 offers conference and workshop tracks that present some of the latest research in deep learning and its applications to areas ranging from computer vision, speech recognition and text understanding to robotics, computational biology, and more.
As a Platinum Sponsor of ICLR 2022 and Champion DEI Action Fund contributor, Google will have a robust presence with nearly 100 accepted publications and extensive participation on organizing committees and in workshops. If you have registered for ICLR 2022, we hope you’ll watch our talks and learn about the work done at Google to address complex problems that affect billions of people. Here you can learn more about the research we will be presenting as well as our general involvement at ICLR 2022 (those with Google affiliations in bold).
Senior Area Chairs: Includes: Been Kim, Dale Schuurmans, Sergey Levine
Area Chairs: Includes: Adam White, Aditya Menon, Aleksandra Faust, Amin Karbasi, Amir Globerson, Andrew Dai, Balaji Lakshminarayanan, Behnam Neyshabur, Ben Poole, Bhuwan Dhingra, Bo Dai, Boqing Gong, Cristian Sminchisescu, David Ha, David Woodruff, Denny Zhou, Dipanjan Das, Dumitru Erhan, Dustin Tran, Emma Strubell, Eunsol Choi, George Dahl, George Tucker, Hanie Sedghi, Heinrich Jiang, Hossein Mobahi, Hugo Larochelle, Izhak Shafran, Jasper Snoek, Jean-Philippe Vert, Jeffrey Pennington, Justin Gilmer, Karol Hausman, Kevin Swersky, Krzysztof Choromanski, Mario Lučić, Mathieu Blondel, Matt Kusner, Michael Ryoo, Ming-Hsuan Yang, Minmin Chen, Mirella Lapata, Mohammad Ghavamzadeh, Mohammad Norouzi, Naman Agarwal, Nicholas Carlini, Olivier Bachem, Piyush Rai, Prateek Jain, Quentin Berthet, Richard Nock, Rose Yu, Sewoong Oh, Silvio Lattanzi, Slav Petrov, Srinadh Bhojanapalli, Tim Salimans, Ting Chen, Tong Zhang, Vikas Sindhwani, Weiran Wang, William Cohen, Xiaoming Liu
Workflow Chairs: Includes: Yaguang Li
Diversity Equity & Inclusion Chairs: Includes: Rosanne Liu
Invited Talks Beyond Interpretability: Developing a Language to Shape Our Relationships with AI Google Speaker: Been Kim
Do You See What I See? Large-Scale Learning from Multimodal Videos Google Speaker: Cordelia Schmid
Publications Hyperparameter Tuning with Renyi Differential Privacy – 2022 Outstanding Paper Award Nicolas Papernot, Thomas Steinke
MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling Yusong Wu, Ethan Manilow, Yi Deng, Rigel Swavely, Kyle Kastner, Tim Cooijmans, Aaron Courville, Cheng-Zhi Anna Huang, Jesse Engel
The Information Geometry of Unsupervised Reinforcement Learning Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
Learning Strides in Convolutional Neural Networks – 2022 Outstanding Paper Award Rachid Riad*, Olivier Teboul, David Grangier, Neil Zeghidour
Poisoning and Backdooring Contrastive Learning Nicholas Carlini, Andreas Terzis
Coordination Among Neural Modules Through a Shared Global Workspace Anirudh Goyal, Aniket Didolkar, Alex Lamb, Kartikeya Badola, Nan Rosemary Ke, Nasim Rahaman, Jonathan Binas, Charles Blundell, Michael Mozer, Yoshua Bengio
Fine-Tuned Language Models Are Zero-Shot Learners (see the blog post) Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le
Large Language Models Can Be Strong Differentially Private Learners Xuechen Li, Florian Tramèr, Percy Liang, Tatsunori Hashimoto
Progressive Distillation for Fast Sampling of Diffusion Models Tim Salimans, Jonathan Ho
Exploring the Limits of Large Scale Pre-training Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi
Scarf: Self-Supervised Contrastive Learning Using Random Feature Corruption Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler
Scalable Sampling for Nonsymmetric Determinantal Point Processes Insu Han, Mike Gartrell, Jennifer Gillenwater, Elvis Dohmatob, Amin Karbasi
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations Xiangning Chen, Cho-Jui Hsieh, Boqing Gong
ViTGAN: Training GANs with Vision Transformers Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu
Generalized Decision Transformer for Offline Hindsight Information Matching Hiroki Furuta, Yutaka Matsuo, Shixiang Shane Gu
The MultiBERTs: BERT Reproductions for Robustness Analysis Thibault Sellam, Steve Yadlowsky, Ian Tenney, Jason Wei, Naomi Saphra, Alexander D’Amour, Tal Linzen, Jasmijn Bastings, Iulia Turc, Jacob Eisenstein, Dipanjan Das, Ellie Pavlick
Scaling Laws for Neural Machine Translation Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, Colin Cherry
Interpretable Unsupervised Diversity Denoising and Artefact Removal Mangal Prakash, Mauricio Delbracio, Peyman Milanfar, Florian Jug
Understanding Latent Correlation-Based Multiview Learning and Self-Supervision: An Identifiability Perspective Qi Lyu, Xiao Fu, Weiran Wang, Songtao Lu
Memorizing Transformers Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, Christian Szegedy
Churn Reduction via Distillation Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh
DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, Sergey Levine
Path Auxiliary Proposal for MCMC in Discrete Space Haoran Sun, Hanjun Dai, Wei Xia, Arun Ramamurthy
On the Relation Between Statistical Learning and Perceptual Distances Alexander Hepburn, Valero Laparra, Raul Santos-Rodriguez, Johannes Ballé, Jesús Malo
Possibility Before Utility: Learning And Using Hierarchical Affordances Robby Costales, Shariq Iqbal, Fei Sha
MT3: Multi-Task Multitrack Music Transcription Josh Gardner*, Ian Simon, Ethan Manilow*, Curtis Hawthorne, Jesse Engel
Bayesian Neural Network Priors Revisited Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel, Gunnar Rätsch, Richard E. Turner, Mark van der Wilk, Laurence Aitchison
GradMax: Growing Neural Networks using Gradient Information Utku Evci, Bart van Merrienboer, Thomas Unterthiner, Fabian Pedregosa, Max Vladymyrov
Scene Transformer: A Unified Architecture for Predicting Future Trajectories of Multiple Agents Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, David Weiss, Ben Sapp, Zhifeng Chen, Jonathon Shlens
The Role of Pretrained Representations for the OOD Generalization of RL Agents Frederik Träuble, Andrea Dittadi, Manuel Wüthrich, Felix Widmaier, Peter Gehler, Ole Winther, Francesco Locatello, Olivier Bachem, Bernhard Schölkopf, Stefan Bauer
Autoregressive Diffusion Models Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans
The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks Rahim Entezari, Hanie Seghi, Olga Saukh, Behnam Neyshabur
DISSECT: Disentangled Simultaneous Explanations via Concept Traversals Asma Ghandeharioun, Been Kim, Chun-Liang Li, Brendan Jou, Brian Eoff, Rosalind W. Picard
Anisotropic Random Feature Regression in High Dimensions Gabriel C. Mel, Jeffrey Pennington
Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation Xiuye Gu, Tsung-Yi Lin*, Weicheng Kuo, Yin Cui
MCMC Should Mix: Learning Energy-Based Model with Flow-Based Backbone Erik Nijkamp*, Ruiqi Gao, Pavel Sountsov, Srinivas Vasudevan, Bo Pang, Song-Chun Zhu, Ying Nian Wu
Effect of Scale on Catastrophic Forgetting in Neural Networks Vinay Ramasesh, Aitor Lewkowycz, Ethan Dyer
Incremental False Negative Detection for Contrastive Learning Tsai-Shien Chen, Wei-Chih Hung, Hung-Yu Tseng, Shao-Yi Chien, Ming-Hsuan Yang
Towards Evaluating the Robustness of Neural Networks Learned by Transduction Jiefeng Chen, Xi Wu, Yang Guo, Yingyu Liang, Somesh Jha
What Do We Mean by Generalization in Federated Learning? Honglin Yuan*, Warren Morningstar, Lin Ning, Karan Singhal
ViDT: An Efficient and Effective Fully Transformer-Based Object Detector Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jampani, Dongyoon Han, Byeongho Heo, Wonjae Kim, Ming-Hsuan Yang
Measuring CLEVRness: Black-Box Testing of Visual Reasoning Models Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski
Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models (see the blog post) Xiaofang Wang, Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Alon (prev. Movshovitz-Attias), Elad Eban
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance Saurabh Garg*, Sivaraman Balakrishnan, Zachary C. Lipton, Behnam Neyshabur, Hanie Sedghi
Data-Driven Offline Optimization for Architecting Hardware Accelerators (see the blog post) Aviral Kumar, Amir Yazdanbakhsh, Milad Hashemi, Kevin Swersky, Sergey Levine
Diurnal or Nocturnal? Federated Learning of Multi-branch Networks from Periodically Shifting Distributions Chen Zhu*, Zheng Xu, Mingqing Chen, Jakub Konecny, Andrew Hard, Tom Goldstein
Policy Gradients Incorporating the Future David Venuto, Elaine Lau, Doina Precup, Ofir Nachum
Discrete Representations Strengthen Vision Transformer Robustness Chengzhi Mao*, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision (see the blog post) Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao
Neural Stochastic Dual Dynamic Programming Hanjun Dai, Yuan Xue, Zia Syed, Dale Schuurmans, Bo Dai
PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions Zhaoqi Leng, Mingxing Tan, Chenxi Liu, Ekin Dogus Cubuk, Xiaojie Shi, Shuyang Cheng, Dragomir Anguelov
Information Prioritization Through Empowerment in Visual Model-Based RL Homanga Bharadhwaj*, Mohammad Babaeizadeh, Dumitru Erhan, Sergey Levine
Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning Dhruv Shah, Peng Xu, Yao Lu, Ted Xiao, Alexander Toshev, Sergey Levine, Brian Ichter
Understanding and Leveraging Overparameterization in Recursive Value Estimation Chenjun Xiao, Bo Dai, Jincheng Mei, Oscar Ramirez, Ramki Gummadi, Chris Harris, Dale Schuurmans
The Efficiency Misnomer Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, Yi Tay
On the Role of Population Heterogeneity in Emergent Communication Mathieu Rita, Florian Strub, Jean-Bastien Grill, Olivier Pietquin, Emmanuel Dupoux
No One Representation to Rule Them All: Overlapping Features of Training Methods Raphael Gontijo-Lopes, Yann Dauphin, Ekin D. Cubuk
Data Poisoning Won’t Save You From Facial Recognition Evani Radiya-Dixit, Sanghyun Hong, Nicholas Carlini, Florian Tramèr
AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, Alex Kurakin
Maximum Entropy RL (Provably) Solves Some Robust RL Problems Benjamin Eysenbach, Sergey Levine
Auto-scaling Vision Transformers Without Training Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou
Optimizing Few-Step Diffusion Samplers by Gradient Descent Daniel Watson, William Chan, Jonathan Ho, Mohammad Norouzi
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, Donald Metzler
Fortuitous Forgetting in Connectionist Networks Hattie Zhou, Ankit Vani, Hugo Larochelle, Aaron Courville
Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent Oliver Bryniarski, Nabeel Hingun, Pedro Pachuca, Vincent Wang, Nicholas Carlini
Benchmarking the Spectrum of Agent Capabilities Danijar Hafner
Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler
Mention Memory: Incorporating Textual Knowledge into Transformers Through Entity Mention Attention Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Fei Sha, William Cohen
Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums Rui Pan, Haishan Ye, Tong Zhang
Scale Efficiently: Insights from Pre-training and Fine-Tuning Transformers Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler
Omni-Scale CNNs: A Simple and Effective Kernel Size Configuration for Time Series Classification Wensi Tang, Guodong Long, Lu Liu,Tianyi Zhou, Michael Blumenstein, Jing Jiang
Embedded-Model Flows: Combining the Inductive Biases of Model-Free Deep Learning and Explicit Probabilistic Modeling Gianluigi Silvestri, Emily Fertig, Dave Moore, Luca Ambrogioni
Post Hoc Explanations May be Ineffective for Detecting Unknown Spurious Correlation Julius Adebayo, Michael Muelly, Hal Abelson, Been Kim
Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning Mark Hamilton, Scott Lundberg, Stephanie Fu, Lei Zhang, William T. Freeman
Pix2seq: A Language Modeling Framework for Object Detection (see the blog post) Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton
Mirror Descent Policy Optimization Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh
CodeTrek: Flexible Modeling of Code Using an Extensible Relational Representation Pardis Pashakhanloo, Aaditya Naik, Yuepeng Wang, Hanjun Dai, Petros Maniatis, Mayur Naik
Conditional Object-Centric Learning From Video Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus Greff
A Loss Curvature Perspective on Training Instabilities of Deep Learning Models Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George E. Dahl, Zack Nado, Orhan Firat
Autonomous Reinforcement Learning: Formalism and Benchmarking Archit Sharma, Kelvin Xu, Nikhil Sardana, Abhishek Gupta, Karol Hausman, Sergey Levine, Chelsea Finn
TRAIL: Near-Optimal Imitation Learning with Suboptimal Data Mengjiao Yang, Sergey Levine, Ofir Nachum
Minimax Optimization With Smooth Algorithmic Adversaries Tanner Fiez, Lillian J. Ratliff, Chi Jin, Praneeth Netrapalli
Unsupervised Semantic Segmentation by Distilling Feature Correspondences Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, William T. Freeman
InfinityGAN: Towards Infinite-Pixel Image Synthesis Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Tulyakov, Ming-Hsuan Yang
Shuffle Private Stochastic Convex Optimization Albert Cheu, Matthew Joseph, Jieming Mao, Binghui Peng
Hybrid Random Features Krzysztof Choromanski, Haoxian Chen, Han Lin, Yuanzhe Ma, Arijit Sehanobish, Deepali Jain, Michael S Ryoo, Jake Varley, Andy Zeng, Valerii Likhosherstov, Dmitry Kalashnikov, Vikas Sindhwani, Adrian Weller
Vector-Quantized Image Modeling With Improved VQGAN Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu
On the Benefits of Maximum Likelihood Estimation for Regression and Forecasting Pranjal Awasthi, Abhimanyu Das, Rajat Sen, Ananda Theertha Suresh
Surrogate Gap Minimization Improves Sharpness-Aware Training Juntang Zhuang*, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C. Dvornek, Sekhar Tatikonda, James S. Duncan, Ting Liu
Online Target Q-learning With Reverse Experience Replay: Efficiently Finding the Optimal Policy for Linear MDPs Naman Agarwal, Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli, Syomantak Chaudhuri
CrossBeam: Learning to Search in Bottom-Up Program Synthesis Kensen Shi, Hanjun Dai, Kevin Ellis, Charles Sutton
Workshops Workshop on the Elements of Reasoning: Objects, Structure, and Causality (OSC) Organizers include: Klaus Greff, Thomas Kipf
Workshop on Agent Learning in Open-Endedness Organizers include: Krishna Srinivasan Speakers include: Natasha Jaques, Danijar Hafner
Wiki-M3L: Wikipedia and Multi-modal & Multi-lingual Research Organizers include: Klaus Greff, Thomas Kipf Speakers include: Jason Baldridge, Tom Duerig
Setting Up ML Evaluation Standards to Accelerate Progress Organizers include: Rishabh Agarwal Speakers and Panelists include: Katherine Heller, Sara Hooker, Corinna Cortes
From Cells to Societies: Collective Learning Across Scales Organizers include: Mark Sandler, Max Vladymyrov Speakers include: Blaise Aguera y Arcas, Alexander Mordvintsev, Michael Mozer
Emergent Communication: New Frontiers Speakers include: Natasha Jaques
Deep Learning for Code Organizers include: Jonathan Herzig
GroundedML: Anchoring Machine Learning in Classical Algorithmic Theory Speakers include: Gintare Karolina Dziugaite
Generalizable Policy Learning in the Physical World Speakers and Panelists include: Mrinal Kalakrishnan
CoSubmitting Summer (CSS) Workshop Organizers include: Rosanne Liu
Object detection is a long-standing computer vision task that attempts to recognize and localize all objects of interest in an image. The complexity arises when trying to identify or localize all object instances while also avoiding duplication. Existing approaches, like Faster R-CNN and DETR, are carefully designed and highly customized in the choice of architecture and loss function. This specialization of existing systems has created two major barriers: (1) it adds complexity in tuning and training the different parts of the system (e.g., region proposal network, graph matching with GIOU loss, etc.), and (2), it can reduce the ability of a model to generalize, necessitating a redesign of the model for application to other tasks.
In “Pix2Seq: A Language Modeling Framework for Object Detection”, published at ICLR 2022, we present a simple and generic method that tackles object detection from a completely different perspective. Unlike existing approaches that are task-specific, we cast object detection as a language modeling task conditioned on the observed pixel inputs. We demonstrate that Pix2Seq achieves competitive results on the large-scale object detection COCO dataset compared to existing highly-specialized and well-optimized detection algorithms, and its performance can be further improved by pre-training the model on a larger object detection dataset. To encourage further research in this direction, we are also excited to release to the broader research community Pix2Seq’s code and pre-trained models along with an interactive demo.
Pix2Seq Overview Our approach is based on the intuition that if a neural network knows where and what the objects in an image are, one could simply teach it how to read them out. By learning to “describe” objects, the model can learn to ground the descriptions on pixel observations, leading to useful object representations. Given an image, the Pix2Seq model outputs a sequence of object descriptions, where each object is described using five discrete tokens: the coordinates of the bounding box’s corners [ymin, xmin, ymax, xmax] and a class label.
With Pix2Seq, we propose a quantization and serialization scheme that converts bounding boxes and class labels into sequences of discrete tokens (similar to captions), and leverage an encoder-decoder architecture to perceive pixel inputs and generate the sequence of object descriptions. The training objective function is simply the maximum likelihood of tokens conditioned on pixel inputs and the preceding tokens.
Sequence Construction from Object Descriptions In commonly used object detection datasets, images have variable numbers of objects, represented as sets of bounding boxes and class labels. In Pix2Seq, a single object, defined by a bounding box and class label, is represented as [ymin, xmin, ymax, xmax, class]. However, typical language models are designed to process discrete tokens (or integers) and are unable to comprehend continuous numbers. So, instead of representing image coordinates as continuous numbers, we normalize the coordinates between 0 and 1 and quantize them into one of a few hundred or thousand discrete bins. The coordinates are then converted into discrete tokens as are the object descriptions, similar to image captions, which in turn can then be interpreted by the language model. The quantization process is achieved by multiplying the normalized coordinate (e.g., ymin) by the number of bins minus one, and rounding it to the nearest integer (the detailed process can be found in our paper).
After quantization, the object annotations provided with each training image are ordered into a sequence of discrete tokens (shown below). Since the order of the objects does not matter for the detection task per se, we randomize the order of objects each time an image is shown during training. We also append an End of Sequence (EOS) token at the end as different images often have different numbers of objects, and hence sequence lengths.
The Model Architecture, Objective Function, and Inference We treat the sequences that we constructed from object descriptions as a “dialect” and address the problem via a powerful and general language model with an image encoder and autoregressive language encoder. Similar to language modeling, Pix2Seq is trained to predict tokens, given an image and preceding tokens, with a maximum likelihood loss. At inference time, we sample tokens from model likelihood. The sampled sequence ends when the EOS token is generated. Once the sequence is generated, we split it into chunks of 5 tokens for extracting and de-quantizing the object descriptions (i.e., obtaining the predicted bounding boxes and class labels). It is worth noting that both the architecture and loss function are task-agnostic in that they don’t assume prior knowledge about object detection (e.g., bounding boxes). We describe how we can incorporate task-specific prior knowledge with a sequence augmentation technique in our paper.
Results Despite its simplicity, Pix2Seq achieves impressive empirical performance on benchmark datasets. Specifically, we compare our method with well established baselines, Faster R-CNN and DETR, on the widely used COCO dataset and demonstrate that it achieves competitive average precision (AP) results.
Since our approach incorporates minimal inductive bias or prior knowledge of the object detection task into the model design, we further explore how pre-training the model using the large-scale object detection COCO dataset can impact its performance. Our results indicate that this training strategy (along with using bigger models) can further boost performance.
Pix2Seq can detect objects in densely populated and complex scenes, such as those shown below.
Conclusion and Future Work With Pix2Seq, we cast object detection as a language modeling task conditioned on pixel inputs for which the model architecture and loss function are generic, and have not been engineered specifically for the detection task. One can, therefore, readily extend this framework to different domains or applications, where the output of the system can be represented by a relatively concise sequence of discrete tokens (e.g., keypoint detection, image captioning, visual question answering), or incorporate it into a perceptual system supporting general intelligence, for which it provides a language interface to a wide range of vision and language tasks. We also hope that the release of our Pix2Seq’s code, pre-trained models and interactive demo will inspire further research in this direction.
Acknowledgements This post reflects the combined work with our co-authors: Saurabh Saxena, Lala Li, Geoffrey Hinton. We would also like to thank Tom Small for the visualization of the Pix2Seq illustration figure.
As consumer electronics and internet-connected appliances are becoming more common, homes are beginning to embrace various types of connected devices that offer functionality like music control, voice assistance, and home automation. A graceful integration of devices requires adaptation to existing aesthetics and user styles rather than simply adding screens, which can easily disrupt a visual space, especially when they become monolithic surfaces or black screens when powered down or not actively used. Thus there is an increasing desire to create connected ambient computing devices and appliances that can preserve the aesthetics of everyday materials, while providing on-demand access to interaction and digital displays.
In “Hidden Interfaces for Ambient Computing: Enabling Interaction in Everyday Materials through High-Brightness Visuals on Low-Cost Matrix Displays”, presented at ACM CHI 2022, we describe an interface technology that is designed to be embedded underneath materials and our vision of how such technology can co-exist with everyday materials and aesthetics. This technology makes it possible to have high-brightness, low-cost displays appear from underneath materials such as textile, wood veneer, acrylic or one-way mirrors, for on-demand touch-based interaction.
For insight into how this works, you can see this video about the Hidden Interfaces project.
Parallel Rendering: Boosting PMOLED Brightness for Ambient Computing While many of today’s consumer devices employ active-matrix organic light-emitting diode (AMOLED) displays, their cost and manufacturing complexity is prohibitive for ambient computing. Yet other display technologies, such as E-ink and LCD, do not have sufficient brightness to penetrate materials.
To address this gap, we explore the potential of passive-matrix OLEDs (PMOLEDs), which are based on a simple design that significantly reduces cost and complexity. However, PMOLEDs typically use scanline rendering, where active display driver circuitry sequentially activates one row at a time, a process that limits display brightness and introduces flicker.
Instead, we propose a system that uses parallel rendering, where as many rows as possible are activated simultaneously in each operation by grouping rectilinear shapes of horizontal and vertical lines. For example, a square can be shown with just two operations, in contrast to traditional scanline rendering that needs as many operations as there are rows. With fewer operations, parallel rendering can output significantly more light in each instant to boost brightness and eliminate flicker. The technique is not strictly limited to lines and rectangles even if that is where we see the most dramatic performance increase. For example, one could add additional rendering steps for antialiasing (i.e., smoothing of) non-rectilinear content.
Rendering User Interfaces and Text We show that hidden interfaces can be used to create dynamic and expressive interactions. With a set of fundamental UI elements such as buttons, switches, sliders, and cursors, each interface can provide different basic controls, such as light switches, volume controls and thermostats. We created a scalable font (i.e., a set of numbers and letters) that is designed for efficient rendering in just a few operations. While we currently exclude letters “k, z, x” with their diagonal lines, they could be supported with additional operations. The per-frame-control of font properties coupled with the high frame rate of the display enables very fluid animations — this capability greatly expands the expressivity of the rectilinear graphics far beyond what is possible on fixed 7-segment LED displays.
In this work, we demonstrate various examples, such as a scalable clock, a caller ID display, a zooming countdown timer, and a music visualizer.
Realizing Hidden Interfaces with Interactive Hardware To implement proof-of-concept hidden interfaces, we use a PMOLED display with 128×96 resolution that has all row and column drivers routed to a connector for direct access. We use a custom printed circuit board (PCB) with fourteen 16-channel digital-to-analog converters (DACs) to directly interface those 224 lines from a Raspberry Pi 3 A+. The touch interaction is enabled by a ring-shaped PCB surrounding the display with 12 electrodes arranged in arc segments.
Comparison to Existing Technologies We compared the brightness of our parallel rendering to both the scanline on the same PMOLED and a small and large state-of-the-art AMOLED. We tested brightness through six common materials, such as wood and plastic. The material thickness ranged from 0.2 mm for the one-way mirror film to 1.6 mm for basswood. We measured brightness in lux (lx = light intensity as perceived by the human eye) using a light meter near the display. The environmental light was kept dim, slightly above the light meter’s minimum sensitivity. For simple rectangular shapes, we observed 5–40x brightness increase for the PMOLED in comparison to the AMOLED. The exception was the thick basswood, which didn’t let much light through for any rendering technology.
To validate the findings from our technical characterization with more realistic and complex content, we evaluate the number “2”, a grid of checkboxes, three progress bars, and the text “Good Life”. For this more complex content, we observed a 3.6–9.3x brightness improvement. These results suggest that our approach of parallel rendering on PMOLED enables display through several materials, and outperforms common state-of-the-art AMOLED displays, which seem to not be usable for the tested scenarios.
What's Next? In this work, we enabled hidden interfaces that can be embedded in traditional materials and appear on demand. Our lab evaluation suggests unmet opportunities to introduce hidden displays with simple, yet expressive, dynamic and interactive UI elements and text in traditional materials, especially wood and mirror, to blend into people’s homes.
In the future, we hope to investigate more advanced parallel rendering techniques, using algorithms that could also support images and complex vector graphics. Furthermore, we plan to explore efficient hardware designs. For example, application-specific integrated circuits (ASICs) could enable an inexpensive and small display controller with parallel rendering instead of a large array of DACs. Finally, longitudinal deployment would enable us to go deeper into understanding user adoption and behavior with hidden interfaces.
Hidden interfaces demonstrate how control and feedback surfaces of smart devices and appliances could visually disappear when not in use and then appear when in the user's proximity or touch. We hope this direction will encourage the community to consider other approaches and scenarios where technology can fade into the background for a more harmonious coexistence with traditional materials and human environments.
Acknowledgements First and foremost, we would like to thank Ali Rahimi and Roman Lewkow for the collaboration, including providing the enabling technology. We also thank Olivier Bau, Aaron Soloway, Mayur Panchal and Sukhraj Hothi for their prototyping and fabrication contributions. We thank Michelle Chang and Mark Zarich for visual designs, illustrations and presentation support. We thank Google ATAP and the Google Interaction Lab for their support of the project. Finally, we thank Sarah Sterman and Mathieu Le Goc for helpful discussions and suggestions.
Form-based document understanding is a growing research topic because of its practical potential for automatically converting unstructured text data into structured information to gain insight about a document’s contents. Recent sequence modeling, which is a self-attention mechanism that directly models relationships between all words in a selection of text, has demonstrated state-of-the-art performance on natural language tasks. A natural approach to handle form document understanding tasks is to first serialize the form documents (usually in a left-to-right, top-to-bottom fashion) and then apply state-of-the-art sequence models to them.
However, form documents often have more complex layouts that contain structured objects, such as tables, columns, and text blocks. Their variety of layout patterns makes serialization difficult, substantially limiting the performance of strict serialization approaches. These unique challenges in form document structural modeling have been largely underexplored in literature.
In “FormNet: Structural Encoding Beyond Sequential Modeling in Form Document Information Extraction”, presented at ACL 2022, we propose a structure-aware sequence model, called FormNet, to mitigate the sub-optimal serialization of forms for document information extraction. First, we design a Rich Attention (RichAtt) mechanism that leverages the 2D spatial relationship between word tokens for more accurate attention weight calculation. Then, we construct Super-Tokens (tokens that aggregate semantically meaningful information from neighboring tokens) for each word by embedding representations from their neighboring tokens through a graph convolutional network (GCN). Finally, we demonstrate that FormNet outperforms existing methods, while using less pre-training data, and achieves state-of-the-art performance on the CORD, FUNSD, and Payment benchmarks.
Extended Transformer Construction (ETC) We adopt ETC as the FormNet model backbone. ETC scales to relatively long inputs by replacing standard attention, which has quadratic complexity, with a sparse global-local attention mechanism that distinguishes between global and long input tokens. The global tokens attend to and are attended by all tokens, but the long tokens attend only locally to other long tokens within a specified local radius, reducing the complexity so that it is more manageable for long sequences.
Rich Attention Our novel architecture, RichAtt, avoids the deficiencies of absolute and relative embeddings by avoiding embeddings entirely. Instead, it computes the order of and log distance between pairs of tokens with respect to the x and y axes on the layout grid, and adjusts the pre-softmax attention scores of each pair as a direct function of these values.
In a traditional attention layer, each token representation is linearly transformed into a Query vector, a Key vector, and a Value vector. A token “looks” for other tokens from which it might want to absorb information (i.e., attend to) by finding the ones with Key vectors that create relatively high scores when matrix-multiplied (called Matmul) by its Query vector and then softmax-normalized. The token then sums together the Value vectors of all other tokens in the sentence, weighted by their score, and passes this up the network, where it will normally be added to the token’s original input vector.
However, other features beyond the Query and Key vectors are often relevant to the decision of how strongly a token should attend to another given token, such as the order they’re in, how many other tokens separate them, or how many pixels apart they are. In order to incorporate these features into the system, we use a trainable parametric function paired with an error network, which takes the observed feature and the output of the parametric function and returns a penalty that reduces the dot product attention score.
At a high level, for each attention head at each layer, FormNet examines each pair of token representations, determines the ideal features the tokens should have if there is a meaningful relationship between them, and penalizes the attention score according to how different the actual features are from the ideal ones. This allows the model to learn constraints on attention using logical implication.
Furthermore, if one assumes that the softmax-normalized attention scores represent a probability distribution, and the distributions for the observed features are known, then this algorithm — including the exact choice of parametric functions and error functions — falls out algebraically, meaning FormNet has a mathematical correctness to it that is lacking from many alternatives (including relative embeddings).
Super-Tokens by Graph Learning The key to sparsifying attention mechanisms in ETC for long sequence modeling is to have every token only attend to tokens that are nearby in the serialized sequence. Although the RichAtt mechanism empowers the transformers by taking the spatial layout structures into account, poor serialization can still block significant attention weight calculation between related word tokens.
To further mitigate the issue, we construct a graph to connect nearby tokens in a form document. We design the edges of the graph based on strong inductive biases so that they have higher probabilities of belonging to the same entity type. For each token, we obtain its Super-Token embedding by applying graph convolutions along these edges to aggregate semantically relevant information from neighboring tokens. We then use these Super-Tokens as an input to the RichAtt ETC architecture. This means that even though an entity may get broken up into multiple segments due to poor serialization, the Super-Tokens learned by the GCN will have retained much of the context of the entity phrase.
Key Results The Figure below shows model size vs. F1 score (the harmonic mean of the precision and recall) for recent approaches on the CORD benchmark. FormNet-A2 outperforms the most recent DocFormer while using a model that is 2.5x smaller. FormNet-A3 achieves state-of-the-art performance with a 97.28% F1 score. For more experimental results, please refer to the paper.
We study the importance of RichAtt and Super-Token by GCN on the large-scale masked language modeling (MLM) pre-training task across three FormNets. Both RichAtt and GCN components improve upon the ETC baseline on reconstructing the masked tokens by a large margin, showing the effectiveness of their structural encoding capability on form documents. The best performance is obtained when incorporating both RichAtt and GCN.
Using BertViz, we visualize the local-to-local attention scores for specific examples from the CORD dataset for the standard ETC and FormNet models. Qualitatively, we confirm that the tokens attend primarily to other tokens within the same visual block for FormNet. Moreover for that model, specific attention heads are attending to tokens aligned horizontally, which is a strong signal of meaning for form documents. No clear attention pattern emerges for the ETC model, suggesting the RichAtt and Super-Token by GCN enable the model to learn the structural cues and leverage layout information effectively.
Conclusion We present FormNet, a novel model architecture for form-based document understanding. We determine that the novel RichAtt mechanism and Super-Token components help the ETC transformer excel at form understanding in spite of sub-optimal, noisy serialization. We demonstrate that FormNet recovers local syntactic information that may have been lost during text serialization and achieves state-of-the-art performance on three benchmarks.
Acknowledgements This research was conducted by Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie, Renshen Wang, Yasuhisa Fujii, and Tomas Pfister. Thanks to Evan Huang, Shengyang Dai, and Salem Elie Haykal for their valuable feedback, and Tom Small for creating the animation in this post.
Supervised learning is a common approach to machine learning (ML) in which the model is trained using data that is labeled appropriately for the task at hand. Ordinary supervised learning trains on independent and identically distributed (IID) data, where all training examples are sampled from a fixed set of classes, and the model has access to these examples throughout the entire training phase. In contrast, continual learning tackles the problem of training a single model on changing data distributions where different classification tasks are presented sequentially. This is particularly important, for example, to enable autonomous agents to process and interpret continuous streams of information in real-world scenarios.
To illustrate the difference between supervised and continual learning, consider two tasks: (1) classify cats vs. dogs and (2) classify pandas vs. koalas. In supervised learning, which uses IID, the model is given training data from both tasks and treats it as a single 4-class classification problem. However, in continual learning, these two tasks arrive sequentially, and the model only has access to the training data of the current task. As a result, such models tend to suffer from performance degradation on the previous tasks, a phenomenon called catastrophic forgetting.
Mainstream solutions try to address catastrophic forgetting by buffering past data in a “rehearsal buffer” and mixing it with current data to train the model. However, the performance of these solutions depends heavily on the size of the buffer and, in some cases, may not be possible at all due to data privacy concerns. Another branch of work designs task-specific components to avoid interference between tasks. But these methods often assume that the task at test time is known, which is not always true, and they require a large number of parameters. The limitations of these approaches raise critical questions for continual learning: (1) Is it possible to have a more effective and compact memory system that goes beyond buffering past data? (2) Can one automatically select relevant knowledge components for an arbitrary sample without knowing its task identity?
In “Learning to Prompt for Continual Learning”, presented at CVPR2022, we attempt to answer these questions. Drawing inspiration from prompting techniques in natural language processing, we propose a novel continual learning framework called Learning to Prompt (L2P). Instead of continually re-learning all the model weights for each sequential task, we instead provide learnable task-relevant “instructions'' (i.e., prompts) to guide pre-trained backbone models through sequential training via a pool of learnable prompt parameters. L2P is applicable to various challenging continual learning settings and outperforms previous state-of-the-art methods consistently on all benchmarks. It achieves competitive results against rehearsal-based methods while also being more memory efficient. Most importantly, L2P is the first to introduce the idea of prompting in the field of continual learning.
Prompt Pool and Instance-Wise Query Given a pre-trained Transformer model, “prompt-based learning” modifies the original input using a fixed template. Imagine a sentiment analysis task is given the input “I like this cat”. A prompt-based method will transform the input to “I like this cat. It looks X”, where the “X” is an empty slot to be predicted (e.g., “nice”, “cute”, etc.) and “It looks X” is the so-called prompt. By adding prompts to the input, one can condition the pre-trained models to solve many downstream tasks. While designing fixed prompts requires prior knowledge along with trial and error, prompt tuning prepends a set of learnable prompts to the input embedding to instruct the pre-trained backbone to learn a single downstream task, under the transfer learning setting.
In the continual learning scenario, L2P maintains a learnable prompt pool, where prompts can be flexibly grouped as subsets to work jointly. Specifically, each prompt is associated with a key that is learned by reducing the cosine similarity loss between matched input query features. These keys are then utilized by a query function to dynamically look up a subset of task-relevant prompts based on the input features. At test time, inputs are mapped by the query function to the top-N closest keys in the prompt pool, and the associated prompt embeddings are then fed to the rest of the model to generate the output prediction. At training, we optimize the prompt pool and the classification head via the cross-entropy loss.
Intuitively, similar input examples tend to choose similar sets of prompts and vice versa. Thus, prompts that are frequently shared encode more generic knowledge while other prompts encode more task-specific knowledge. Moreover, prompts store high-level instructions and keep lower-level pre-trained representations frozen, thus catastrophic forgetting is mitigated even without the necessity of a rehearsal buffer. The instance-wise query mechanism removes the necessity of knowing the task identity or boundaries, enabling this approach to address the under-investigated challenge of task-agnostic continual learning.
Effectiveness of L2P We evaluate the effectiveness of L2P in different baseline methods using an ImageNet pre-trained Vision Transformer (ViT) on representative benchmarks. The naïve baseline, called Sequential in the graphs below, refers to training a single model sequentially on all tasks. The EWC model adds a regularization term to mitigate forgetting and the Rehearsal model saves past examples to a buffer for mixed training with current data. To measure the overall continual learning performance, we measure both the accuracy and the average difference between the best accuracy achieved during training and the final accuracy for all tasks (except the last task), which we call forgetting. We find that L2P outperforms the Sequential and EWC methods significantly in both metrics. Notably, L2P even surpasses the Rehearsal approach, which uses an additional buffer to save past data. Because the L2P approach is orthogonal to Rehearsal, its performance could be further improved if it, too, used a rehearsal buffer.
We also visualize the prompt selection result from our instance-wise query strategy on two different benchmarks, where one has similar tasks and the other has varied tasks. The results indicate that L2P promotes more knowledge sharing between similar tasks by having more shared prompts, and less knowledge sharing between varied tasks by having more task-specific prompts.
Conclusion In this work, we present L2P to address key challenges in continual learning from a new perspective. L2P does not require a rehearsal buffer or known task identity at test time to achieve high performance. Further, it can handle various complex continual learning scenarios, including the challenging task-agnostic setting. Because large-scale pre-trained models are widely used in the machine learning community for their robust performance on real-world problems, we believe that L2P opens a new learning paradigm towards practical continual learning applications.
Acknowledgements We gratefully acknowledge the contributions of other co-authors, including Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, Tomas Pfister. We would also like to thank Chun-Liang Li, Jeremy Martin Kubica, Sayna Ebrahimi, Stratis Ioannidis, Nan Hua, and Emmanouil Koukoumidis, for their valuable discussions and feedback, and Tom Small for figure creation.