Patent documents typically use legal and highly technical language, with context-dependent terms that may have meanings quite different from colloquial usage and even between different documents. The process of using traditional patent search methods (e.g., keyword searching) to search through the corpus of over one hundred million patent documents can be tedious and result in many missed results due to the broad and non-standard language used. For example, a "soccer ball" may be described as a "spherical recreation device", "inflatable sportsball" or "ball for ball game". Additionally, the language used in some patent documents may obfuscate terms to their advantage, so more powerful natural language processing (NLP) and semantic similarity understanding can give everyone access to do a thorough search.
The patent domain (and more general technical literature like scientific publications) poses unique challenges for NLP modeling due to its use of legal and technical terms. While there are multiple commonly used general-purpose semantic textual similarity (STS) benchmark datasets (e.g., STS-B, SICK, MRPC, PIT), to the best of our knowledge, there are currently no datasets focused on technical concepts found in patents and scientific publications (the somewhat related BioASQ challenge contains a biomedical question answering task). Moreover, with the continuing growth in size of the patent corpus (millions of new patents are issued worldwide every year), there is a need to develop more useful NLP models for this domain.
Today, we announce the release of the Patent Phrase Similarity dataset, a new human-rated contextual phrase-to-phrase semantic matching dataset, and the accompanying paper, presented at the SIGIR PatentSemTech Workshop, which focuses on technical terms from patents. The Patent Phrase Similarity dataset contains ~50,000 rated phrase pairs, each with a Cooperative Patent Classification (CPC) class as context. In addition to similarity scores that are typically included in other benchmark datasets, we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, and domain related. This dataset (distributed under the Creative Commons Attribution 4.0 International license) was used by Kaggle and USPTO as the benchmark dataset in the U.S. Patent Phrase to Phrase Matching competition to draw more attention to the performance of machine learning models on technical text. Initial results show that models fine-tuned on this new dataset perform substantially better than general pre-trained models without fine-tuning.
The Patent Phrase Similarity Dataset To better train the next generation of state-of-the-art models, we created the Patent Phrase Similarity dataset, which includes many examples to address the following problems: (1) phrase disambiguation, (2) adversarial keyword matching, and (3) hard negative keywords (i.e., keywords that are unrelated but received a high score for similarity from other models ). Some keywords and phrases can have multiple meanings (e.g., the phrase "mouse" may refer to an animal or a computer input device), so we disambiguate the phrases by including CPC classes with each pair of phrases. Also, many NLP models (e.g., bag of words models) will not do well on data with phrases that have matching keywords but are otherwise unrelated (adversarial keywords, e.g., “container section” → “kitchen container”, “offset table” → “table fan”). The Patent Phrase Similarity dataset is designed to include many examples of matching keywords that are unrelated through adversarial keyword match, enabling NLP models to improve their performance.
Each entry in the Patent Phrase Similarity dataset contains two phrases, an anchor and target, a context CPC class, a rating class, and a similarity score. The dataset contains 48,548 entries with 973 unique anchors, split into training (75%), validation (5%), and test (20%) sets. When splitting the data, all of the entries with the same anchor are kept together in the same set. There are 106 different context CPC classes and all of them are represented in the training set.
Generating the Dataset To generate the Patent Phrase Similarity data, we first process the ~140 million patent documents in the Google Patent's corpus and automatically extract important English phrases, which are typically noun phrases (e.g., “fastener”, “lifting assembly”) and functional phrases (e.g., “food processing”, “ink printing”). Next, we filter and keep phrases that appear in at least 100 patents and randomly sample around 1,000 of these filtered phrases, which we call anchor phrases. For each anchor phrase, we find all of the matching patents and all of the CPC classes for those patents. We then randomly sample up to four matching CPC classes, which become the context CPC classes for the specific anchor phrase.
We use two different methods for pre-generating target phrases: (1) partial matching and (2) a masked language model (MLM). For partial matching, we randomly select phrases from the entire corpus that partially match with the anchor phrase (e.g., “abatement” → “noise abatement”, “material formation” → “formation material”). For MLM, we select sentences from the patents that contain a given anchor phrase, mask them out, and use the Patent-BERT model to predict candidates for the masked portion of the text. Then, all of the phrases are cleaned up, which includes lowercasing and the removal of punctuation and certain stopwords (e.g., "and", "or", "said"), and sent to expert raters for review. Each phrase pair is rated independently by two raters skilled in the technology area. Each rater also generates new target phrases with different ratings. Specifically, they are asked to generate some low-similarity and unrelated targets that partially match with the original anchor and/or some high-similarity targets. Finally, the raters meet to discuss their ratings and come up with final ratings.
Dataset Evaluation To evaluate its performance, the Patent Phrase Similarity dataset was used in the U.S. Patent Phrase to Phrase Matching Kaggle competition. The competition was very popular, drawing about 2,000 competitors from around the world. A variety of approaches were successfully used by the top scoring teams, including ensemble models of BERT variants and prompting (see the full discussion for more details). The table below shows the best results from the competition, as well as several off-the-shelf baselines from our paper. The Pearson correlation metric was used to measure the linear correlation between the predicted and true scores, which is a helpful metric to target for downstream models so they can distinguish between different similarity ratings.
The baselines in the paper can be considered zero-shot in the sense that they use off-the-shelf models without any further fine-tuning on the new dataset (we use these models to embed the anchor and target phrases separately and compute the cosine similarity between them). The Kaggle competition results demonstrate that by using our training data, one can achieve significant improvements compared with existing NLP models. We have also estimated human performance on this task by comparing a single rater’s scores to the combined score of both raters. The results indicate that this is not a particularly easy task, even for human experts.
Conclusion and Future Work We present the Patent Phrase Similarity dataset, which was used as the benchmark dataset in the U.S. Patent Phrase to Phrase Matching competition, and demonstrate that by using our training data, one can achieve significant improvements compared with existing NLP models.
Additional challenging machine learning benchmarks can be generated from the patent corpus, and patent data has made its way into many of today's most-studied models. For example, the C4 text dataset used to train T5 contains many patent documents. The BigBird and LongT5 models also use patents via the BIGPATENT dataset. The availability, breadth and open usage terms of full text data (see Google Patents Public Datasets) makes patents a unique resource for the research community. Possibilities for future tasks include massively multi-label classification, summarization, information retrieval, image-text similarity, citation graph prediction, and translation. See the paper for more details.
Acknowledgements This work was possible through a collaboration with Kaggle, Satsyil Corp., USPTO, and MaxVal. Thanks to contributors Ian Wetherbee from Google, Will Cukierski and Maggie Demkin from Kaggle. Thanks to Jerry Ma, Scott Beliveau, and Jamie Holcombe from USPTO and Suja Chittamahalingam from MaxVal for their contributions.
In recent years video conferencing has played an increasingly important role in both work and personal communication for many users. Over the past two years, we have enhanced this experience in Google Meet by introducing privacy-preserving machine learning (ML) powered background features, also known as “virtual green screen”, which allows users to blur their backgrounds or replace them with other images. What is unique about this solution is that it runs directly in the browser without the need to install additional software.
So far, these ML-powered features have relied on CPU inference made possible by leveraging neural network sparsity, a common solution that works across devices, from entry level computers to high-end workstations. This enables our features to reach the widest audience. However, mid-tier and high-end devices often have powerful GPUs that remain untapped for ML inference, and existing functionality allows web browsers to access GPUs via shaders (WebGL).
With the latest update to Google Meet, we are now harnessing the power of GPUs to significantly improve the fidelity and performance of these background effects. As we detail in “Efficient Heterogeneous Video Segmentation at the Edge”, these advances are powered by two major components: 1) a novel real-time video segmentation model and 2) a new, highly efficient approach for in-browser ML acceleration using WebGL. We leverage this capability to develop fast ML inference via fragment shaders. This combination results in substantial gains in accuracy and latency, leading to crisper foreground boundaries.
Moving Towards Higher Quality Video Segmentation Models To predict finer details, our new segmentation model now operates on high definition (HD) input images, rather than lower-resolution images, effectively doubling the resolution over the previous model. To accommodate this, the model must be of higher capacity to extract features with sufficient detail. Roughly speaking, doubling the input resolution quadruples the computation cost during inference.
Inference of high-resolution models using the CPU is not feasible for many devices. The CPU may have a few high-performance cores that enable it to execute arbitrary complex code efficiently, but it is limited in its ability for the parallel computation required for HD segmentation. In contrast, GPUs have many, relatively low-performance cores coupled with a wide memory interface, making them uniquely suitable for high-resolution convolutional models. Therefore, for mid-tier and high-end devices, we adopt a significantly faster pure GPU pipeline, which is integrated using WebGL.
This change inspired us to revisit some of the prior design decisions for the model architecture.
In aggregate, these changes substantially improve the mean Intersection over Union (IoU) metric by 3%, resulting in less uncertainty and crisper boundaries around hair and fingers. We have also released the accompanying model card for this segmentation model, which details our fairness evaluations. Our analysis shows that the model is consistent in its performance across the various regions, skin-tones, and genders, with only small deviations in IoU metrics.
Accelerating Web ML with WebGL One common challenge for web-based inference is that web technologies can incur a performance penalty when compared to apps running natively on-device. For GPUs, this penalty is substantial, only achieving around 25% of native OpenGL performance. This is because WebGL, the current GPU standard for Web-based inference, was primarily designed for image rendering, not arbitrary ML workloads. In particular, WebGL does not include compute shaders, which allow for general purpose computation and enable ML workloads in mobile and native apps.
To overcome this challenge, we accelerated low-level neural network kernels with fragment shaders that typically compute the output properties of a pixel like color and depth, and then applied novel optimizations inspired by the graphics community. As ML workloads on GPUs are often bound by memory bandwidth rather than compute, we focused on rendering techniques that would improve the memory access, such as Multiple Render Targets (MRT).
MRT is a feature in modern GPUs that allows rendering images to multiple output textures (OpenGL objects that represent images) at once. While MRT was originally designed to support advanced graphics rendering such as deferred shading, we found that we could leverage this feature to drastically reduce the memory bandwidth usage of our fragment shader implementations for critical operations, like convolutions and fully connected layers. We do so by treating intermediate tensors as multiple OpenGL textures.
In the figure below, we show an example of intermediate tensors having four underlying GL textures each. With MRT, the number of GPU threads, and thus effectively the number of memory requests for weights, is reduced by a factor of four and saves memory bandwidth usage. Although this introduces considerable complexities in the code, it helps us reach over 90% of native OpenGL performance, closing the gap with native applications.
Conclusion We have made rapid strides in improving the quality of real-time segmentation models by leveraging the GPU on mid-tier and high-end devices for use with Google Meet. We look forward to the possibilities that will be enabled by upcoming technologies like WebGPU, which bring compute shaders to the web. Beyond GPU inference, we're also working on improving the segmentation quality for lower powered devices with quantized inference via XNNPACK WebAssembly.
Acknowledgements Special thanks to those on the Meet team and others who worked on this project, in particular Sebastian Jansson, Sami Kalliomäki, Rikard Lundmark, Stephan Reiter, Fabian Bergmark, Ben Wagner, Stefan Holmer, Dan Gunnarsson, Stéphane Hulaud, and to all our team members who made this possible: Siargey Pisarchyk, Raman Sarokin, Artsiom Ablavatski, Jamie Lin, Tyler Mullen, Gregory Karpiak, Andrei Kulik, Karthik Raveendran, Trent Tolley, and Matthias Grundmann.
The widespread availability of mobile phones has enabled non-profits to deliver critical health information to their beneficiaries in a timely manner. While advanced applications on smartphones allow for richer multimedia content and two-way communication between beneficiaries and health coaches, simpler text and voice messaging services can be effective in disseminating information to large communities, particularly those that are underserved with limited access to information and smartphones. ARMMAN1, one non-profit doing just this, is based in India with the mission of improving maternal and child health outcomes in underserved communities.
One of the programs run by them is mMitra, which employs automated voice messaging to deliver timely preventive care information to expecting and new mothers during pregnancy and until one year after birth. These messages are tailored according to the gestational age of the beneficiary. Regular listenership to these messages has been shown to have a high correlation with improved behavioral and health outcomes, such as a 17% increase in infants with tripled birth weight at end of year and a 36% increase in women knowing the importance of taking iron tablets.
However, a key challenge ARMMAN faced was that about 40% of women gradually stopped engaging with the program. While it’s possible to mitigate this with live service calls to women to explain the advantage of listening to the messages, it is infeasible to call all the low listeners in the program because of limited support staff — this highlights the importance of effectively prioritizing who receives such service calls.
In “Field Study in Deploying Restless Multi-Armed Bandits: Assisting Non-Profits in Improving Maternal and Child Health”, published in AAAI 2022, we describe an ML-based solution that uses historical data from the NGO to predict which beneficiaries will benefit most from service calls. We address the challenges that come with a large-scale real world deployment of such a system and show the usefulness of deploying this model in a real study involving over 23,000 participants. The model showed an increase in listenership of 30% compared to the current standard of care group.
Background We model this resource optimization problem using restless multi-armed bandits (RMABs), which have been well studied for application to such problems in a myriad of domains, including healthcare. An RMAB consists of n arms where each arm (representing a beneficiary) is associated with a two-state Markov decision process (MDP). Each MDP is modeled as a two-state (good or bad state, where the good state corresponds to high listenership in the previous week), two-action (corresponding to whether the beneficiary was chosen to receive a service call or not) problem. Further, each MDP has an associated reward function (i.e., the reward accumulated at a given state and action) and a transition function indicating the probability of moving from one state to the next under a given action, under the Markov condition that the next state depends only on the previous state and the action taken on that arm in that time step. The term restless indicates that all arms can change state irrespective of the action.
Model Development Finally, the RMAB problem is modeled such that at any time step, given n total arms, which k arms should be acted on (i.e., chosen to receive a service call), to maximize reward (engagement with the program).
The probability of transitioning from one state to another with (active probability) or without (passive probability) receiving a service call are therefore the underlying model parameters that are critical to solving the above optimization. To estimate these parameters, we use the demographic data of the beneficiaries collected at time of enrolment by the NGO, such as age, income, education, number of children, etc., as well as past listenership data, all in-line with the NGO’s data privacy standards (more below).
However, the limited volume of service calls limits the data corresponding to receiving a service call. To mitigate this, we use clustering techniques to learn from the collective observations of beneficiaries within a cluster and enable overcoming the challenge of limited samples per individual beneficiary.
In particular, we perform clustering on listenership behaviors, and then compute a mapping from the demographic features to each cluster.
This mapping is useful because when a new beneficiary is enrolled, we only have access to their demographic information and have no knowledge of their listenership patterns, since they haven’t had a chance to listen yet. Using the mapping, we can infer transition probabilities for any new beneficiary that enrolls into the system.
We used several qualitative and quantitative metrics to infer the optimal set of of clusters and explored different combinations of training data (demographic features only, features plus passive probabilities, features plus all probabilities, passive probabilities only) to achieve the most meaningful clusters, that are representative of the underlying data distribution and have a low variance in individual cluster sizes.
Clustering has the added advantage of reducing computational cost for resource-limited NGOs, as the optimization needs to be solved at a cluster level rather than an individual level. Finally, solving RMAB’s is known to be P-space hard, so we choose to solve the optimization using the popular Whittle index approach, which ultimately provides a ranking of beneficiaries based on their likely benefit of receiving a service call.
Results We evaluated the model in a real world study consisting of approximately 23,000 beneficiaries who were divided into three groups: the current standard of care (CSOC) group, the "round robin" (RR) group, and the RMAB group. The beneficiaries in the CSOC group follow the original standard of care, where there are no NGO initiated service calls. The RR group represents the scenario where the NGO often conducts service calls using some systematic set order — the idea here is to have an easily executable policy that services enough of a cross-section of beneficiaries and can be scaled up or down per week based on available resources (this is the approach used by the NGO in this particular case, but the approach may vary for different NGOs). The RMAB group receives service calls as predicted by the RMAB model. All the beneficiaries across the three groups continue to receive the automated voice messages independent of the service calls.
At the end of seven weeks, RMAB-based service calls resulted in the highest (and statistically significant) reduction in cumulative engagement drops (32%) compared to the CSOC group.
Ethical Considerations An ethics board at the NGO reviewed the study. We took significant measures to ensure participant consent is understood and recorded in a language of the community's choice at each stage of the program. Data stewardship resides in the hands of the NGO, and only the NGO is allowed to share data. The code will soon be available publicly. The pipeline only uses anonymized data and no personally identifiable information (PII) is made available to the models. Sensitive data, such as caste, religion, etc., are not collected by ARMMAN for mMitra. Therefore, in pursuit of ensuring fairness of the model, we worked with public health and field experts to ensure other indicators of socioeconomic status were measured and adequately evaluated as shown below.
The proportion of beneficiaries that received a live service call within each income bracket reasonably matches the proportion in the overall population. However, differences are observed in lower income categories, where the RMAB model favors beneficiaries with lower income and beneficiaries with no formal education. Lastly, domain experts at ARMMAN have been deeply involved in the development and testing of this system and have provided continuous input and oversight in data interpretation, data consumption, and model design.
Conclusions After thorough testing, the NGO has currently deployed this system for scheduling of service calls on a weekly basis. We are hopeful that this will pave the way for more deployments of ML algorithms for social impact in partnerships with non-profits in service of populations that have so far benefited less from ML. This work was also featured in Google for India 2021.
Acknowledgements This work is part of our AI for Social Good efforts and was led by Google Research, India. Thanks to all our collaborators at ARMMAN, Google Research India, Google.org, and University Relations: Aparna Hegde, Neha Madhiwalla, Suresh Chaudhary, Aditya Mate, Lovish Madaan, Shresth Verma, Gargi Singh, Divy Thakkar.
1ARMMAN runs multiple programs to provide preventive care information to women through pregnancy and infancy enabling them to seek care, as well as programs to train and support health workers for timely detection and management of high-risk conditions. ↩
Online video sharing platforms, like YouTube, need to understand perceptual video quality (i.e., a user's subjective perception of video quality) in order to better optimize and improve user experience. Video quality assessment (VQA) attempts to build a bridge between video signals and perceptual quality by using objective mathematical models to approximate the subjective opinions of users. Traditional video quality metrics, like peak signal-to-noise ratio (PSNR) and Video Multi-Method Assessment Fusion (VMAF), are reference-based and focus on the relative difference between the target and reference videos. Such metrics, which work best on professionally generated content (e.g., movies), assume the reference video is of pristine quality and that one can induce the target video's absolute quality from the relative difference.
However, the majority of the videos that are uploaded on YouTube are user-generated content (UGC), which bring new challenges due to their remarkably high variability in video content and original quality. Most UGC uploads are non-pristine and the same amount of relative difference could imply very different perceptual quality impacts. For example, people tend to be less sensitive to the distortions of poor quality uploads than of high quality uploads. Thus, reference-based quality scores become inaccurate and inconsistent when used for UGC cases. Additionally, despite the high volume of UGC, there are currently limited UGC video quality assessment (UGC-VQA) datasets with quality labels. Existing UGC-VQA datasets are either small in size (e.g., LIVE-Qualcomm has 208 samples captured from 54 unique scenes), compared with datasets with millions of samples for classification and recognition (e.g., ImageNet and YouTube-8M), or don’t have enough content variability (sampling without considering content information, like LIVE-VQC and KoNViD-1k).
In "Rich Features for Perceptual Quality Assessment of UGC Videos", published at CVPR 2021, we describe how we attempt to solve the UGC quality assessment problem by building a Universal Video Quality (UVQ) model that resembles a subjective quality assessment. The UVQ model uses subnetworks to analyze UGC quality from high-level semantic information to low-level pixel distortions, and provides a reliable quality score with rationale (leveraging comprehensive and interpretable quality labels). Moreover, to advance UGC-VQA and compression research, we enhance the open-sourced YouTube-UGC dataset, which contains 1.5K representative UGC samples from millions of UGC videos (distributed under the Creative Commons license) on YouTube. The updated dataset contains ground-truth labels for both original videos and corresponding transcoded versions, enabling us to better understand the relationship between video content and its perceptual quality.
Subjective Video Quality Assessment To understand perceptual video quality, we leverage an internal crowd-sourcing platform to collect mean opinion scores (MOS) with a scale of 1–5, where 1 is the lowest quality and 5 is the highest quality, for no-reference use cases. We collect ground-truth labels from the YouTube-UGC dataset and categorize UGC factors that affect quality perception into three high-level categories: (1) content, (2) distortions, and (3) compression. For example, a video with no meaningful content won't receive a high quality MOS. Also, distortions introduced during the video production phase and video compression artifacts introduced by third-party platforms, e.g., transcoding or transmission, will degrade the overall quality.
We demonstrate that the left gaming video in the second row of the figure above has the lowest MOS (1.2), even lower than the video with no meaningful content. A possible explanation is that viewers may have higher video quality expectations for videos that have a clear narrative structure, like gaming videos, and the blur artifacts significantly reduce the perceptual quality of the video.
UVQ Model Framework A common method for evaluating video quality is to design sophisticated features, and then map these features to a MOS. However, designing useful handcrafted features is difficult and time-consuming, even for domain experts. Also, the most useful existing handcrafted features were summarized from limited samples, which may not perform well on broader UGC cases. In contrast, machine learning is becoming more prominent in UGC-VQA because it can automatically learn features from large-scale samples.
A straightforward approach is to train a model from scratch on existing UGC quality datasets. However, this may not be feasible as there are limited quality UGC datasets. To overcome this limitation, we apply a self-supervised learning step to the UVQ model during training. This self-supervised step enables us to learn comprehensive quality-related features, without ground-truth MOS, from millions of raw videos.
Following the quality-related categories summarized from the subjective VQA, we develop the UVQ model with four novel subnetworks. The first three subnetworks, which we call ContentNet, DistortionNet and CompressionNet, are used to extract quality features (i.e., content, distortion and compression), and the fourth subnetwork, called AggregationNet, maps the extracted features to generate a single quality score. ContentNet is trained in a supervised learning fashion with UGC-specific content labels that are generated by the YouTube-8M model. DistortionNet is trained to detect common distortions, e.g., Gaussian blur and white noise of the original frame. CompressionNet focuses on video compression artifacts, whose training data are videos compressed with different bitrates. CompressionNet is trained using two compressed variants of the same content that are fed into the model to predict corresponding compression levels (with a higher score for more noticeable compression artifacts), with the implicit assumption that the higher bitrate version has a lower compression level.
The ContentNet, DistortionNet and CompressionNet subnetworks are trained on large-scale samples without ground-truth quality scores. Since video resolution is also an important quality factor, the resolution-sensitive subnetworks (CompressionNet and DistortionNet) are patch-based (i.e., each input frame is divided into multiple disjointed patches that are processed separately), which makes it possible to capture all detail on native resolution without downscaling. The three subnetworks extract quality features that are then concatenated by the fourth subnetwork, AggregationNet, to predict quality scores with domain ground-truth MOS from YouTube-UGC.
Analyzing Video Quality with UVQ After building the UVQ model, we use it to analyze the video quality of samples pulled from YouTube-UGC and demonstrate that its subnetworks can provide a single quality score along with high-level quality indicators that can help us understand quality issues. For example, DistortionNet detects multiple visual artifacts, e.g., jitter and lens blur, for the middle video below, and CompressionNet detects that the bottom video has been heavily compressed.
Additionally, UVQ can provide patch-based feedback to locate quality issues. Below, UVQ reports that the quality of the first patch (patch at time t = 1) is good with a low compression level. However, the model identifies heavy compression artifacts in the next patch (patch at time t = 2).
In practice, UVQ can generate a video diagnostic report that includes a content description (e.g., strategy video game), distortion analysis (e.g., the video is blurry or pixelated) and compression level (e.g., low or high compression). Below, UVQ reports that the content quality, looking at individual features, is good, but the compression and distortion quality is low. When combining all three features, the overall quality is medium-low. We see that these findings are close to the rationale summarized by internal user experts, demonstrating that UVQ can reason through quality assessments, while providing a single quality score.
Conclusion We present the UVQ model, which generates a report with quality scores and insights that can be used to interpret UGC video perceptual quality. UVQ learns comprehensive quality related features from millions of UGC videos and provides a consistent view of quality interpretation for both no-reference and reference cases. To learn more, read our paper or visit our website to see YT-UGC videos and their subjective quality data. We also hope that the enhanced YouTube-UGC dataset enables more research in this space.
Acknowledgements This work was possible through a collaboration spanning several Google teams. Key contributors include: Balu Adsumilli, Neil Birkbeck, Joong Gon Yim from YouTube and Junjie Ke, Hossein Talebi, Peyman Milanfar from Google Research. Thanks to Ross Wolf, Jayaprasanna Jayaraman, Carena Church, and Jessie Lin for their contributions.
One of the most important aspects in machine learning is hyperparameter optimization, as finding the right hyperparameters for a machine learning task can make or break a model’s performance. Internally, we regularly use Google Vizier as the default platform for hyperparameter optimization. Throughout its deployment over the last 5 years, Google Vizier has been used more than 10 million times, over a vast class of applications, including machine learning applications from vision, reinforcement learning, and language but also scientific applications such as protein discovery and hardware acceleration. As Google Vizier is able to keep track of use patterns in its database, such data, usually consisting of optimization trajectories termed studies, contain very valuable prior information on realistic hyperparameter tuning objectives, and are thus highly attractive for developing better algorithms.
While there have been many previous methods for meta-learning over such data, such methods share one major common drawback: their meta-learning procedures depend heavily on numerical constraints such as the number of hyperparameters and their value ranges, and thus require all tasks to use the exact same total hyperparameter search space (i.e., tuning specifications). Additional textual information in the study, such as its description and parameter names, are also rarely used, yet can hold meaningful information about the type of task being optimized. Such a drawback becomes more exacerbated for larger datasets, which often contain significant amounts of such meaningful information.
Today in “Towards Learning Universal Hyperparameter Optimizers with Transformers”, we are excited to introduce the OptFormer, one of the first Transformer-based frameworks for hyperparameter tuning, learned from large-scale optimization data using flexible text-based representations. While numerous works have previously demonstrated the Transformer’s strong abilities across various domains, few have touched on its optimization-based capabilities, especially over text space. Our core findings demonstrate for the first time some intriguing algorithmic abilities of Transformers: 1) a single Transformer network is capable of imitating highly complex behaviors from multiple algorithms over long horizons; 2) the network is further capable of predicting objective values very accurately, in many cases surpassing Gaussian Processes, which are commonly used in algorithms such as Bayesian Optimization.
Approach: Representing Studies as Tokens Rather than only using numerical data as common with previous methods, our novel approach instead utilizes concepts from natural language and represents all of the study data as a sequence of tokens, including textual information from initial metadata. In the animation below, this includes “CIFAR10”, “learning rate”, “optimizer type”, and “Accuracy”, which informs the OptFormer of an image classification task. The OptFormer then generates new hyperparameters to try on the task, predicts the task accuracy, and finally receives the true accuracy, which will be used to generate the next round’s hyperparameters. Using the T5X codebase, the OptFormer is trained in a typical encoder-decoder fashion using standard generative pretraining over a wide range of hyperparameter optimization objectives, including real world data collected by Google Vizier, as well as public hyperparameter (HPO-B) and blackbox optimization benchmarks (BBOB).
Imitating Policies As the OptFormer is trained over optimization trajectories by various algorithms, it may now accurately imitate such algorithms simultaneously. By providing a text-based prompt in the metadata for the designated algorithm (e.g. “Regularized Evolution”), the OptFormer will imitate the algorithm’s behavior.
Predicting Objective Values In addition, the OptFormer may now predict the objective value being optimized (e.g. accuracy) and provide uncertainty estimates. We compared the OptFormer’s prediction with a standard Gaussian Process and found that the OptFormer was able to make significantly more accurate predictions. This can be seen below qualitatively, where the OptFormer’s calibration curve closely follows the ideal diagonal line in a goodness-of-fit test, and quantitatively through standard aggregate metrics such as log predictive density.
Combining Both: Model-based Optimization We may now use the OptFormer’s function prediction capability to better guide our imitated policy, similar to techniques found in Bayesian Optimization. Using Thompson Sampling, we may rank our imitated policy’s suggestions and only select the best according to the function predictor. This produces an augmented policy capable of outperforming our industry-grade Bayesian Optimization algorithm in Google Vizier when optimizing classic synthetic benchmark objectives and tuning the learning rate hyperparameters of a standard CIFAR-10 training pipeline.
Conclusion Throughout this work, we discovered some useful and previously unknown optimization capabilities of the Transformer. In the future, we hope to pave the way for a universal hyperparameter and blackbox optimization interface to use both numerical and textual data to facilitate optimization over complex search spaces, and integrate the OptFormer with the rest of the Transformer ecosystem (e.g. language, vision, code) by leveraging Google’s vast collection of offline AutoML data.
Acknowledgements The following members of DeepMind and the Google Research Brain Team conducted this research: Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'aurelio Ranzato, Sagi Perel, and Nando de Freitas.
We would like to also thank Chris Dyer, Luke Metz, Kevin Murphy, Yannis Assael, Frank Hutter, and Esteban Real for providing valuable feedback, and further thank Sebastian Pineda Arango, Christof Angermueller, and Zachary Nado for technical discussions on benchmarks. In addition, we thank Daniel Golovin, Daiyi Peng, Yingjie Miao, Jack Parker-Holder, Jie Tan, Lucio Dery, and Aleksandra Faust for multiple useful conversations.
Finally, we thank Tom Small for designing the animation for this post.
Over the last several years, we have seen significant progress in applying machine learning to robotics. However, robotic systems today are capable of executing only very short, hard-coded commands, such as “Pick up an apple,” because they tend to perform best with clear tasks and rewards. They struggle with learning to perform long-horizon tasks and reasoning about abstract goals, such as a user prompt like “I just worked out, can you get me a healthy snack?”
Meanwhile, recent progress in training language models (LMs) has led to systems that can perform a wide range of language understanding and generation tasks with impressive results. However, these language models are inherently not grounded in the physical world due to the nature of their training process: a language model generally does not interact with its environment nor observe the outcome of its responses. This can result in it generating instructions that may be illogical, impractical or unsafe for a robot to complete in a physical context. For example, when prompted with “I spilled my drink, can you help?” the language model GPT-3 responds with “You could try using a vacuum cleaner,” a suggestion that may be unsafe or impossible for the robot to execute. When asking the FLAN language model the same question, it apologizes for the spill with "I'm sorry, I didn't mean to spill it,” which is not a very useful response. Therefore, we asked ourselves, is there an effective way to combine advanced language models with robot learning algorithms to leverage the benefits of both?
In “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”, we present a novel approach, developed in partnership with Everyday Robots, that leverages advanced language model knowledge to enable a physical agent, such as a robot, to follow high-level textual instructions for physically-grounded tasks, while grounding the language model in tasks that are feasible within a specific real-world context. We evaluate our method, which we call PaLM-SayCan, by placing robots in a real kitchen setting and giving them tasks expressed in natural language. We observe highly interpretable results for temporally-extended complex and abstract tasks, like “I just worked out, please bring me a snack and a drink to recover.” Specifically, we demonstrate that grounding the language model in the real world nearly halves errors over non-grounded baselines. We are also excited to release a robot simulation setup where the research community can test this approach.
A Dialog Between User and Robot, Facilitated by the Language Model Our approach uses the knowledge contained in language models (Say) to determine and score actions that are useful towards high-level instructions. It also uses an affordance function (Can) that enables real-world-grounding and determines which actions are possible to execute in a given environment. Using the the PaLM language model, we call this PaLM-SayCan.
Our system can be seen as a dialog between the user and robot, facilitated by the language model. The user starts by giving an instruction that the language model turns into a sequence of steps for the robot to execute. This sequence is filtered using the robot’s skillset to determine the most feasible plan given its current state and environment. The model determines the probability of a specific skill successfully making progress toward completing the instruction by multiplying two probabilities: (1) task-grounding (i.e., a skill language description) and (2) world-grounding (i.e., skill feasibility in the current state).
There are additional benefits of our approach in terms of its safety and interpretability. First, by allowing the LM to score different options rather than generate the most likely output, we effectively constrain the LM to only output one of the pre-selected responses. In addition, the user can easily understand the decision making process by looking at the separate language and affordance scores, rather than a single output.
Training Policies and Value Functions Each skill in the agent’s skillset is defined as a policy with a short language description (e.g., “pick up the can”), represented as embeddings, and an affordance function that indicates the probability of completing the skill from the robot’s current state. To learn the affordance functions, we use sparse reward functions set to 1.0 for a successful execution, and 0.0 otherwise.
We use image-based behavioral cloning (BC) to train the language-conditioned policies and temporal-difference-based (TD) reinforcement learning (RL) to train the value functions. To train the policies, we collected data from 68,000 demos performed by 10 robots over 11 months and added 12,000 successful episodes, filtered from a set of autonomous episodes of learned policies. We then learned the language conditioned value functions using MT-Opt in the Everyday Robots simulator. The simulator complements our real robot fleet with a simulated version of the skills and environment, which is transformed using RetinaGAN to reduce the simulation-to-real gap. We bootstrapped simulation policies’ performance by using demonstrations to provide initial successes, and then continuously improved RL performance with online data collection in simulation.
Performance on Temporally-Extended, Complex, and Abstract Instructions To test our approach, we use robots from Everyday Robots paired with PaLM. We place the robots in a kitchen environment containing common objects and evaluate them on 101 instructions to test their performance across various robot and environment states, instruction language complexity and time horizon. Specifically, these instructions were designed to showcase the ambiguity and complexity of language rather than to provide simple, imperative queries, enabling queries such as “I just worked out, how would you bring me a snack and a drink to recover?” instead of “Can you bring me water and an apple?”
We use two metrics to evaluate the system’s performance: (1) the plan success rate, indicating whether the robot chose the right skills for the instruction, and (2) the execution success rate, indicating whether it performed the instruction successfully. We compare two language models, PaLM and FLAN (a smaller language model fine-tuned on instruction answering) with and without the affordance grounding as well as the underlying policies running directly with natural language (Behavioral Cloning in the table below). The results show that the system using PaLM with affordance grounding (PaLM-SayCan) chooses the correct sequence of skills 84% of the time and executes them successfully 74% of the time, reducing errors by 50% compared to FLAN and compared to PaLM without robotic grounding. This is particularly exciting because it represents the first time we can see how an improvement in language models translates to a similar improvement in robotics. This result indicates a potential future where robotics is able to ride the wave of progress that we have been observing in language models, bringing these subfields of research closer together.
If you're interested in learning more about this project from the researchers themselves, please check out the video below:
Conclusion and Future Work We’re excited about the progress that we’ve seen with PaLM-SayCan, an interpretable and general approach to leveraging knowledge from language models that enables a robot to follow high-level textual instructions to perform physically-grounded tasks. Our experiments on a number of real-world robotic tasks demonstrate the ability to plan and complete long-horizon, abstract, natural language instructions at a high success rate. We believe that PaLM-SayCan’s interpretability allows for safe real-world user interaction with robots. As we explore future directions for this work, we hope to better understand how information gained via the robot’s real-world experience could be leveraged to improve the language model and to what extent natural language is the right ontology for programming robots. We have open-sourced a robot simulation setup, which we hope will provide researchers with a valuable resource for future research that combines robotic learning with advanced language models. The research community can visit the project’s GitHub page and website to learn more.
Acknowledgements We’d like to thank our coauthors Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Kelly Fu, Keerthana Gopalakrishnan, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. We’d also like to thank Yunfei Bai, Matt Bennice, Maarten Bosma, Justin Boyd, Bill Byrne, Kendra Byrne, Noah Constant, Pete Florence, Laura Graesser, Rico Jonschkowski, Daniel Kappler, Hugo Larochelle, Benjamin Lee, Adrian Li, Maysam Moussalem, Suraj Nair, Krista Reymann, Jeff Seto, Dhruv Shah, Ian Storz, Razvan Surdulescu, and Vincent Zhao for their help and support in various aspects of the project. And we’d like to thank Tom Small for creating many of the animations in this post.