While model design and training data are key ingredients in a deep neural network’s (DNN’s) success, less-often discussed is the specific optimization method used for updating the model parameters (weights). Training DNNs involves minimizing a loss function that measures the discrepancy between the ground truth labels and the model’s predictions. Training is carried out by backpropagation, which adjusts the model weights via gradient descent steps. Gradient descent, in turn, updates the weights by using the gradient (i.e., derivative) of the loss with respect to the weights.
The simplest weight update corresponds to stochastic gradient descent, which, in every step, moves the weights in the negative direction with respect to the gradients (with an appropriate step size, a.k.a. the learning rate). More advanced optimization methods modify the direction of the negative gradient before updating the weights by using information from the past steps and/or the local properties (such as the curvature information) of the loss function around the current weights. For instance, a momentum optimizer encourages moving along the average direction of past updates, and the AdaGrad optimizer scales each coordinate based on the past gradients. These optimizers are commonly known as first-order methods since they generally modify the update direction using only information from the first-order derivative (i.e., gradient). More importantly, the components of the weight parameters are treated independently from each other.
More advanced optimization, such as Shampoo and K-FAC, capture the correlations between gradients of parameters and have been shown to improve convergence, reducing the number of iterations and improving the quality of the solution. These methods capture information about the local changes of the derivatives of the loss, i.e., changes in gradients. Using this additional information, higher-order optimizers can discover much more efficient update directions for training models by taking into account the correlations between different groups of parameters. On the downside, calculating higher-order update directions is computationally more expensive than first-order updates. The operation uses more memory for storing statistics and involves matrix inversion, thus hindering the applicability of higher-order optimizers in practice.
In “LocoProp: Enhancing BackProp via Local Loss Optimization”, we introduce a new framework for training DNN models. Our new framework, LocoProp, conceives neural networks as a modular composition of layers. Generally, each layer in a neural network applies a linear transformation on its inputs, followed by a non-linear activation function. In the new construction, each layer is allotted its own weight regularizer, output target, and loss function. The loss function of each layer is designed to match the activation function of the layer. Using this formulation, training minimizes the local losses for a given mini-batch of examples, iteratively and in parallel across layers. Our method performs multiple local updates per batch of examples using a first-order optimizer (like RMSProp), which avoids computationally expensive operations such as the matrix inversions required for higher-order optimizers. However, we show that the combined local updates look rather like a higher-order update. Empirically, we show that LocoProp outperforms first-order methods on a deep autoencoder benchmark and performs comparably to higher-order optimizers, such as Shampoo and K-FAC, without the high memory and computation requirements.
Method Neural networks are generally viewed as composite functions that transform model inputs into output representations, layer by layer. LocoProp adopts this view while decomposing the network into layers. In particular, instead of updating the weights of the layer to minimize the loss function at the output, LocoProp applies pre-defined local loss functions specific to each layer. For a given layer, the loss function is selected to match the activation function, e.g., a tanh loss would be selected for a layer with a tanh activation. Each layerwise loss measures the discrepancy between the layer's output (for a given mini-batch of examples) and a notion of a target output for that layer. Additionally, a regularizer term ensures that the updated weights do not drift too far from the current values. The combined layerwise loss function (with a local target) plus regularizer is used as the new objective function for each layer.
Perhaps the simplest loss function one can think of for a layer is the squared loss. While the squared loss is a valid choice of a loss function, LocoProp takes into account the possible non-linearity of the activation functions of the layers and applies layerwise losses tailored to the activation function of each layer. This enables the model to emphasize regions at the input that are more important for the model prediction while deemphasizing the regions that do not affect the output as much. Below we show examples of tailored losses for the tanh and ReLU activation functions.
After forming the objective in each layer, LocoProp updates the layer weights by repeatedly applying gradient descent steps on its objective. The update typically uses a first-order optimizer (like RMSProp). However, we show that the overall behavior of the combined updates closely resembles higher-order updates (shown below). Thus, LocoProp provides training performance close to what higher-order optimizers achieve without the high memory or computation needed for higher-order methods, such as matrix inverse operations. We show that LocoProp is a flexible framework that allows the recovery of well-known algorithms and enables the construction of new algorithms via different choices of losses, targets, and regularizers. LocoProp’s layerwise view of neural networks also allows updating the weights in parallel across layers.
ExperimentsIn our paper, we describe experiments on the deep autoencoder model, which is a commonly used baseline for evaluating the performance of optimization algorithms. We perform extensive tuning on multiple commonly used first-order optimizers, including SGD, SGD with momentum, AdaGrad, RMSProp, and Adam, as well as the higher-order Shampoo and K-FAC optimizers, and compare the results with LocoProp. Our findings indicate that the LocoProp method performs significantly better than first-order optimizers and is comparable to those of higher-order, while being significantly faster when run on a single GPU.
Summary and Future Directions We introduced a new framework, called LocoProp, for optimizing deep neural networks more efficiently. LocoProp decomposes neural networks into separate layers with their own regularizer, output target, and loss function and applies local updates in parallel to minimize the local objectives. While using first-order updates for the local optimization problems, the combined updates closely resemble higher-order update directions, both theoretically and empirically.
LocoProp provides flexibility to choose the layerwise regularizers, targets, and loss functions. Thus, it allows the development of new update rules based on these choices. Our code for LocoProp is available online on GitHub. We are currently working on scaling up ideas induced by LocoProp to much larger scale models; stay tuned!
Acknowledgments We would like to thank our co-author, Manfred K. Warmuth, for his critical contributions and inspiring vision. We would like to thank Sameer Agarwal for discussions looking at this work from a composite functions perspective, Vineet Gupta for discussions and development of Shampoo, Zachary Nado on K-FAC, Tom Small for development of the animation used in this blogpost and finally, Yonghui Wu and Zoubin Ghahramani for providing us with a nurturing research environment in the Google Brain Team.
In natural conversations, we don't say people's names every time we speak to each other. Instead, we rely on contextual signaling mechanisms to initiate conversations, and eye contact is often all it takes. Google Assistant, now available in more than 95 countries and over 29 languages, has primarily relied on a hotword mechanism ("Hey Google" or “OK Google”) to help more than 700 million people every month get things done across Assistant devices. As virtual assistants become an integral part of our everyday lives, we're developing ways to initiate conversations more naturally.
At Google I/O 2022, we announced Look and Talk, a major development in our journey to create natural and intuitive ways to interact with Google Assistant-powered home devices. This is the first multimodal, on-device Assistant feature that simultaneously analyzes audio, video, and text to determine when you are speaking to your Nest Hub Max. Using eight machine learning models together, the algorithm can differentiate intentional interactions from passing glances in order to accurately identify a user's intent to engage with Assistant. Once within 5ft of the device, the user may simply look at the screen and talk to start interacting with the Assistant.
We developed Look and Talk in alignment with our AI Principles. It meets our strict audio and video processing requirements, and like our other camera sensing features, video never leaves the device. You can always stop, review and delete your Assistant activity at myactivity.google.com. These added layers of protection enable Look and Talk to work just for those who turn it on, while keeping your data safe.
Modeling Challenges The journey of this feature began as a technical prototype built on top of models developed for academic research. Deployment at scale, however, required solving real-world challenges unique to this feature. It had to:
The evolution of the algorithm involved experiments with approaches ranging from domain adaptation and personalization to domain-specific dataset development, field-testing and feedback, and repeated tuning of the overall algorithm.
Technology Overview A Look and Talk interaction has three phases. In the first phase, Assistant uses visual signals to detect when a user is demonstrating an intent to engage with it and then “wakes up” to listen to their utterance. The second phase is designed to further validate and understand the user’s intent using visual and acoustic signals. Look and Talk considers all signals in the first and second processing phases to determine if the interaction is likely intended for Assistant. These two phases are the core Look and Talk functionality, and are discussed below. The third phase of query fulfillment is typical query flow, and is beyond the scope of this blog.
Phase One: Engaging with Assistant The first phase of Look and Talk is designed to assess whether an enrolled user is intentionally engaging with Assistant. Look and Talk uses face detection to identify the user’s presence, filters for proximity using the detected face box size to infer distance, and then uses the existing Face Match system to determine whether they are enrolled Look and Talk users.
For an enrolled user within range, an custom eye gaze model determines whether they are looking at the device. This model estimates both the gaze angle and a binary gaze-on-camera confidence from image frames using a multi-tower convolutional neural network architecture, with one tower processing the whole face and another processing patches around the eyes. Since the device screen covers a region underneath the camera that would be natural for a user to look at, we map the gaze angle and binary gaze-on-camera prediction to the device screen area. To ensure that the final prediction is resilient to spurious individual predictions and involuntary eye blinks and saccades, we apply a smoothing function to the individual frame-based predictions to remove spurious individual predictions.
We enforce stricter attention requirements before informing users that the system is ready for interaction to minimize false triggers, e.g., when a passing user briefly glances at the device. Once the user looking at the device starts speaking, we relax the attention requirement, allowing the user to naturally shift their gaze.
The final signal necessary in this processing phase checks that the Face Matched user is the active speaker. This is provided by a multimodal active speaker detection model that takes as input both video of the user’s face and the audio containing speech, and predicts whether they are speaking. A number of augmentation techniques (including RandAugment, SpecAugment, and augmenting with AudioSet sounds) helps improve prediction quality for the in-home domain, boosting end-feature performance by over 10%.The final deployed model is a quantized, hardware-accelerated TFLite model, which uses five frames of context for the visual input and 0.5 seconds for the audio input.
Phase Two: Assistant Starts Listening In phase two, the system starts listening to the content of the user’s query, still entirely on-device, to further assess whether the interaction is intended for Assistant using additional signals. First, Look and Talk uses Voice Match to further ensure that the speaker is enrolled and matches the earlier Face Match signal. Then, it runs a state-of-the-art automatic speech recognition model on-device to transcribe the utterance.
The next critical processing step is the intent understanding algorithm, which predicts whether the user’s utterance was intended to be an Assistant query. This has two parts: 1) a model that analyzes the non-lexical information in the audio (i.e., pitch, speed, hesitation sounds) to determine whether the utterance sounds like an Assistant query, and 2) a text analysis model that determines whether the transcript is an Assistant request. Together, these filter out queries not intended for Assistant. It also uses contextual visual signals to determine the likelihood that the interaction was intended for Assistant.
Finally, when the intent understanding model determines that the user utterance was likely meant for Assistant, Look and Talk moves into the fulfillment phase where it communicates with the Assistant server to obtain a response to the user’s intent and query text.
Performance, Personalization and UX Each model that supports Look and Talk was evaluated and improved in isolation and then tested in the end-to-end Look and Talk system. The huge variety of ambient conditions in which Look and Talk operates necessitates the introduction of personalization parameters for algorithm robustness. By using signals obtained during the user’s hotword-based interactions, the system personalizes parameters to individual users to deliver improvements over the generalized global model. This personalization also runs entirely on-device.
Without a predefined hotword as a proxy for user intent, latency was a significant concern for Look and Talk. Often, a strong enough interaction signal does not occur until well after the user has started speaking, which can add hundreds of milliseconds of latency, and existing models for intent understanding add to this since they require complete, not partial, queries. To bridge this gap, Look and Talk completely forgoes streaming audio to the server, with transcription and intent understanding being on-device. The intent understanding models can work off of partial utterances. This results in an end-to-end latency comparable with current hotword-based systems.
The UI experience is based on user research to provide well-balanced visual feedback with high learnability. This is illustrated in the figure below.
We developed a diverse video dataset with over 3,000 participants to test the feature across demographic subgroups. Modeling improvements driven by diversity in our training data improved performance for all subgroups.
Conclusion Look and Talk represents a significant step toward making user engagement with Google Assistant as natural as possible. While this is a key milestone in our journey, we hope this will be the first of many improvements to our interaction paradigms that will continue to reimagine the Google Assistant experience responsibly. Our goal is to make getting help feel natural and easy, ultimately saving time so users can focus on what matters most.
Acknowledgements This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, UX, and cross-functional contributors. Key contributors from Google Assistant include Alexey Galata, Alice Chuang, Barbara Wang, Britanie Hall, Gabriel Leblanc, Gloria McGee, Hideaki Matsui, James Zanoni, Joanna (Qiong) Huang, Krunal Shah, Kavitha Kandappan, Pedro Silva, Tanya Sinha, Tuan Nguyen, Vishal Desai, Will Truong, Yixing Cai, Yunfan Ye; from Research including Hao Wu, Joseph Roth, Sagar Savla, Sourish Chaudhuri, Susanna Ricco. Thanks to Yuan Yuan and Caroline Pantofaru for their leadership, and everyone on the Nest, Assistant, and Research teams who provided invaluable input toward the development of Look and Talk.
The increasing complexity of code poses a key challenge to productivity in software engineering. Code completion has been an essential tool that has helped mitigate this complexity in integrated development environments (IDEs). Conventionally, code completion suggestions are implemented with rule-based semantic engines (SEs), which typically have access to the full repository and understand its semantic structure. Recent research has demonstrated that large language models (e.g., Codex and PaLM) enable longer and more complex code suggestions, and as a result, useful products have emerged (e.g., Copilot). However, the question of how code completion powered by machine learning (ML) impacts developer productivity, beyond perceived productivity and accepted suggestions, remains open.
Today we describe how we combined ML and SE to develop a novel Transformer-based hybrid semantic ML code completion, now available to internal Google developers. We discuss how ML and SEs can be combined by (1) re-ranking SE single token suggestions using ML, (2) applying single and multi-line completions using ML and checking for correctness with the SE, or (3) using single and multi-line continuation by ML of single token semantic suggestions. We compare the hybrid semantic ML code completion of 10k+ Googlers (over three months across eight programming languages) to a control group and see a 6% reduction in coding iteration time (time between builds and tests) and a 7% reduction in context switches (i.e., leaving the IDE) when exposed to single-line ML completion. These results demonstrate that the combination of ML and SEs can improve developer productivity. Currently, 3% of new code (measured in characters) is now generated from accepting ML completion suggestions.
Transformers for Completion A common approach to code completion is to train transformer models, which use a self-attention mechanism for language understanding, to enable code understanding and completion predictions. We treat code similar to language, represented with sub-word tokens and a SentencePiece vocabulary, and use encoder-decoder transformer models running on TPUs to make completion predictions. The input is the code that is surrounding the cursor (~1000-2000 tokens) and the output is a set of suggestions to complete the current or multiple lines. Sequences are generated with a beam search (or tree exploration) on the decoder.
During training on Google’s monorepo, we mask out the remainder of a line and some follow-up lines, to mimic code that is being actively developed. We train a single model on eight languages (C++, Java, Python, Go, Typescript, Proto, Kotlin, and Dart) and observe improved or equal performance across all languages, removing the need for dedicated models. Moreover, we find that a model size of ~0.5B parameters gives a good tradeoff for high prediction accuracy with low latency and resource cost. The model strongly benefits from the quality of the monorepo, which is enforced by guidelines and reviews. For multi-line suggestions, we iteratively apply the single-line model with learned thresholds for deciding whether to start predicting completions for the following line.
Re-rank Single Token Suggestions with ML While a user is typing in the IDE, code completions are interactively requested from the ML model and the SE simultaneously in the backend. The SE typically only predicts a single token. The ML models we use predict multiple tokens until the end of the line, but we only consider the first token to match predictions from the SE. We identify the top three ML suggestions that are also contained in the SE suggestions and boost their rank to the top. The re-ranked results are then shown as suggestions for the user in the IDE.
In practice, our SEs are running in the cloud, providing language services (e.g., semantic completion, diagnostics, etc.) with which developers are familiar, and so we collocated the SEs to run on the same locations as the TPUs performing ML inference. The SEs are based on an internal library that offers compiler-like features with low latencies. Due to the design setup, where requests are done in parallel and ML is typically faster to serve (~40 ms median), we do not add any latency to completions. We observe a significant quality improvement in real usage. For 28% of accepted completions, the rank of the completion is higher due to boosting, and in 0.4% of cases it is worse. Additionally, we find that users type >10% fewer characters before accepting a completion suggestion.
Check Single / Multi-line ML Completions for Semantic Correctness At inference time, ML models are typically unaware of code outside of their input window, and code seen during training might miss recent additions needed for completions in actively changing repositories. This leads to a common drawback of ML-powered code completion whereby the model may suggest code that looks correct, but doesn’t compile. Based on internal user experience research, this issue can lead to the erosion of user trust over time while reducing productivity gains.
We use SEs to perform fast semantic correctness checks within a given latency budget (<100ms for end-to-end completion) and use cached abstract syntax trees to enable a “full” structural understanding. Typical semantic checks include reference resolution (i.e., does this object exist), method invocation checks (e.g., confirming the method was called with a correct number of parameters), and assignability checks (to confirm the type is as expected).
For example, for the coding language Go, ~8% of suggestions contain compilation errors before semantic checks. However, the application of semantic checks filtered out 80% of uncompilable suggestions. The acceptance rate for single-line completions improved by 1.9x over the first six weeks of incorporating the feature, presumably due to increased user trust. As a comparison, for languages where we did not add semantic checking, we only saw a 1.3x increase in acceptance.
Results With 10k+ Google-internal developers using the completion setup in their IDE, we measured a user acceptance rate of 25-34%. We determined that the transformer-based hybrid semantic ML code completion completes >3% of code, while reducing the coding iteration time for Googlers by 6% (at a 90% confidence level). The size of the shift corresponds to typical effects observed for transformational features (e.g., key framework) that typically affect only a subpopulation, whereas ML has the potential to generalize for most major languages and engineers.
Providing Long Completions while Exploring APIs We also tightly integrated the semantic completion with full line completion. When the dropdown with semantic single token completions appears, we display inline the single-line completions returned from the ML model. The latter represent a continuation of the item that is the focus of the dropdown. For example, if a user looks at possible methods of an API, the inline full line completions show the full method invocation also containing all parameters of the invocation.
Conclusion and Future Work We demonstrate how the combination of rule-based semantic engines and large language models can be used to significantly improve developer productivity with better code completion. As a next step, we want to utilize SEs further, by providing extra information to ML models at inference time. One example can be for long predictions to go back and forth between the ML and the SE, where the SE iteratively checks correctness and offers all possible continuations to the ML model. When adding new features powered by ML, we want to be mindful to go beyond just “smart” results, but ensure a positive impact on productivity.
Acknowledgements This research is the outcome of a two-year collaboration between Google Core and Google Research, Brain Team. Special thanks to Marc Rasi, Yurun Shen, Vlad Pchelin, Charles Sutton, Varun Godbole, Jacob Austin, Danny Tarlow, Benjamin Lee, Satish Chandra, Ksenia Korovina, Stanislav Pyatykh, Cristopher Claeys, Petros Maniatis, Evgeny Gryaznov, Pavel Sychev, Chris Gorgolewski, Kristof Molnar, Alberto Elizondo, Ambar Murillo, Dominik Schulz, David Tattersall, Rishabh Singh, Manzil Zaheer, Ted Ying, Juanjo Carin, Alexander Froemmgen, Maxim Kachurovskiy, and Marcus Revaj for their contributions.
Current deep reinforcement learning (RL) methods can train specialist artificial agents that excel at decision-making on various individual tasks in specific environments, such as Go or StarCraft. However, little progress has been made to extend these results to generalist agents that would not only be capable of performing many different tasks, but also upon a variety of environments with potentially distinct embodiments.
Looking across recent progress in the fields of natural language processing, vision, and generative models (such as PaLM, Imagen, and Flamingo), we see that breakthroughs in making general-purpose models are often achieved by scaling up Transformer-based models and training them on large and semantically diverse datasets. It is natural to wonder, can a similar strategy be used in building generalist agents for sequential decision making? Can such models also enable fast adaptation to new tasks, similar to PaLM and Flamingo?
As an initial step to answer these questions, in our recent paper “Multi-Game Decision Transformers” we explore how to build a generalist agent to play many video games simultaneously. Our model trains an agent that can play 41 Atari games simultaneously at close-to-human performance and that can also be quickly adapted to new games via fine-tuning. This approach significantly improves upon the few existing alternatives to learning multi-game agents, such as temporal difference (TD) learning or behavioral cloning (BC).
Don’t Optimize for Return, Just Ask for Optimality In reinforcement learning, reward refers to the incentive signals that are relevant to completing a task, and return refers to cumulative rewards in a course of interactions between an agent and its surrounding environment. Traditional deep reinforcement learning agents (DQN, SimPLe, Dreamer, etc) are trained to optimize decisions to achieve the optimal return. At every time step, an agent observes the environment (some also consider the interactions that happened in the past) and decides what action to take to help itself achieve a higher return magnitude in future interactions.
In this work, we use Decision Transformers as our backbone approach to training an RL agent. A Decision Transformer is a sequence model that predicts future actions by considering past interactions between an agent and the surrounding environment, and (most importantly) a desired return to be achieved in future interactions. Instead of learning a policy to achieve high return magnitude as in traditional reinforcement learning, Decision Transformers map diverse experiences, ranging from expert-level to beginner-level, to their corresponding return magnitude during training. The idea is that training an agent on a range of experiences (from beginner to expert level) exposes the model to a wider range of variations in gameplay, which in turn helps it extract useful rules of gameplay that allow it to succeed under any circumstance. So during inference, the Decision Transformer can achieve any return value in the range it has seen during training, including the optimal return.
But, how do you know if a return is both optimal and stably achievable in a given environment? Previous applications of Decision Transformers relied on customized definitions of the desired return for each individual task, which required manually defining a plausible and informative range of scalar values that are appropriately interpretable signals for each specific game — a task that is non-trivial and rather unscalable. To address this issue, we instead model a distribution of return magnitudes based on past interactions with the environment during training. At inference time, we simply add an optimality bias that increases the probability of generating actions that are associated with higher returns.
To more comprehensively capture spatial-temporal patterns of agent-environment interactions, we also modified the Decision Transformer architecture to consider image patches instead of a global image representation. Patches allow the model to focus on local dynamics, which helps model game specific information in further detail.
These pieces together give us the backbone of Multi-Game Decision Transformers:
Training a Multi-Game Decision Transformer to Play 41 Games at Once We train one Decision Transformer agent on a large (~1B) and broad set of gameplay experiences from 41 Atari games. In our experiments, this agent, which we call the Multi-Game Decision Transformer (MGDT), clearly outperforms existing reinforcement learning and behavioral cloning methods — by almost 2 times — on learning to play 41 games simultaneously and performs near human-level competency (100% in the following figure corresponds to the level of human gameplay). These results hold when comparing across training methods in both settings where a policy must be learned from static datasets (offline) as well as those where new data can be gathered from interacting with the environment (online).
This result indicates that Decision Transformers are well-suited for multi-task, multi-environment, and multi-embodiment agents.
A concurrent work, “A Generalist Agent”, shows a similar result, demonstrating that large transformer-based sequence models can memorize expert behaviors very well across many more environments. In addition, their work and our work have nicely complementary findings: They show it’s possible to train across a wide range of environments beyond Atari games, while we show it’s possible and useful to train across a wide range of experiences.
In addition to the performance shown above, empirically we found that MGDT trained on a wide variety of experience is better than MDGT trained only on expert-level demonstrations or simply cloning demonstration behaviors.
Scaling Up Multi-Game Model Size to Achieve Better Performance Arguably, scale has become the main driving force in many recent machine learning breakthroughs, and it is usually achieved by increasing the number of parameters in a transformer-based model. Our observation on Multi-Game Decision Transformers is similar: the performance increases predictably with larger model size. In particular, its performance appears to have not yet hit a ceiling, and compared to other learning systems performance gains are more significant with increases in model size.
Pre-trained Multi-Game Decision Transformers Are Fast Learners Another benefit of MGDTs is that they can learn how to play a new game from very few gameplay demonstrations (which don’t need to all be expert-level). In that sense, MGDTs can be considered pre-trained models capable of being fine-tuned rapidly on small new gameplay data. Compared with other popular pre-training methods, it clearly shows consistent advantages in obtaining higher scores.
Where Is the Agent Looking? In addition to the quantitative evaluation, it’s insightful (and fun) to visualize the agent’s behavior. By probing the attention heads, we find that the MGDT model consistently places weight in its field of view to areas of the observed images that contain meaningful game entities. We visualize the model’s attention when predicting the next action for various games and find it consistently attends to entities such as the agent’s on screen avatar, agent’s free movement space, non-agent objects, and key environment features. For example, in an interactive setting, having an accurate world model requires knowing how and when to focus on known objects (e.g., currently present obstacles) as well as expecting and/or planning over future unknowns (e.g., negative space). This diverse allocation of attention to many key components of each environment ultimately improves performance.
The Future of Large-Scale Generalist Agents This work is an important step in demonstrating the possibility of training general-purpose agents across many environments, embodiments, and behavior styles. We have shown the benefit of increased scale on performance and the potential with further scaling. These findings seem to point to a generalization narrative similar to other domains like vision and language, which hints at the great potential of scaling data and effectiveness of learning from diverse experiences.
We look forward to future research towards developing performant agents for multi-environment and multi-embodiment settings. Our code and model checkpoints can be accessed here.
Acknowledgements We’d like to thank all remaining authors of the paper including Igor Mordatch, Ofir Nachum Menjiao Yang, Lisa Lee, Daniel Freeman, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski.
Every year, nearly a billion chest X-ray (CXR) images are taken globally to aid in the detection and management of health conditions ranging from collapsed lungs to infectious diseases. Generally, CXRs are cheaper and more accessible than other forms of medical imaging. However, existing challenges continue to impede the optimal use of CXRs. For example, in some areas, trained radiologists that can accurately interpret CXR images are in short supply. In addition, interpretation variability between experts, workflow differences between institutions, and the presence of rare conditions familiar only to subspecialists all contribute to making high-quality CXR interpretation a challenge.
Recent research has leveraged machine learning (ML) to explore potential solutions for some of these challenges. There is significant interest and effort devoted to building deep learning models that detect abnormalities in CXRs and improve access, accuracy, and efficiency to identify diseases and conditions that affect the heart and lungs. However, building robust CXR models requires large labeled training datasets, which can be prohibitively expensive and time-consuming to create. In some cases, such as working with underrepresented populations or studying rare medical conditions, only limited data are available. Additionally, CXR images vary in quality across populations, geographies, and institutions, making it difficult to build robust models that perform well globally.
In “Simplified Transfer Learning for Chest Radiography Models Using Less Data”, published in the journal Radiology, we describe how Google Health utilizes advanced ML methods to generate pre-trained “CXR networks” that can convert CXR images to embeddings (i.e., information-rich numerical vectors) to enable the development of CXR models using less data and fewer computational resources. We demonstrate that even with less data and compute, this approach has enabled performance comparable to state-of-the-art deep learning models across various prediction tasks. We are also excited to announce the release of CXR Foundation, a tool that utilizes our CXR-specific network to enable developers to create custom embeddings for their CXR images. We believe this work will help accelerate the development of CXR models, aiding in disease detection and contributing to more equitable health access throughout the world.
Developing a Chest X-ray Network A common approach to building medical ML models is to pre-train a model on a generic task using non-medical datasets and then refine the model on a target medical task. This process of transfer learning may improve the target task performance or at least speed up convergence by applying the understanding of natural images to medical images. However, transfer learning may still require large labeled medical datasets for the refinement step.
Expanding on this standard approach, our system supports modeling CXR-specific tasks through a three-step model training setup composed of (1) generic image pre-training similar to traditional transfer learning, (2) CXR-specific pre-training, and (3) task-specific training. The first and third steps are common in ML: first pre-training on a large dataset and labels that are not specific to the desired task, and then fine-tuning on the task of interest.
We built a CXR-specific image classifier that employs supervised contrastive learning (SupCon). SupCon pulls together representations of images that have the same label (e.g., abnormal) and pushes apart representations of images that have a different label (e.g., one normal image and one abnormal image). We pre-trained this model on de-identified CXR datasets of over 800,000 images generated in partnership with Northwestern Medicine and Apollo Hospitals in the US and India, respectively. We then leveraged noisy abnormality labels from natural language processing of radiology reports to build our “CXR-specific” network.
This network creates embeddings (i.e., information-rich numerical vectors that can be used to distinguish classes from each other) that can more easily train models for specific medical prediction tasks, such as image finding (e.g., airspace opacity), clinical condition (e.g., tuberculosis), or patient outcome (e.g., hospitalization). For example, the CXR network can generate embeddings for every image in a given CXR dataset. For these images, the generated embeddings and the labels for the desired target task (such as tuberculosis) are used as examples to train a small ML model.
Effects of CXR Pre-training We visualized these embedding layers at each step of the process using airspace opacity as an example (see the figure below). Before SupCon-based pre-training, there was poor separation of normal and abnormal CXR embeddings. After SupCon-based pre-training, the positive examples were grouped more closely together, and the negative examples more closely together as well, indicating that the model had identified that images from each category resembled themselves.
Our research suggests that adding the second stage of pre-training enables high-quality models to be trained with up to 600-fold less data in comparison to traditional transfer learning approaches that leverage pre-trained models on generic, non-medical datasets. We found this to be true regardless of model architecture (e.g., ResNet or EfficientNet) or dataset used for natural image pre-training (e.g., ImageNet or JFT-300M). With this approach, researchers and developers can significantly reduce dataset size requirements.
Results After training the initial model, we measured performance using the area under the curve (AUC) metric with both linear and non-linear models applied to CXR embeddings; and a non-linear model produced by fine-tuning the entire network. On public datasets, such as ChestX-ray14 and CheXpert, our work substantially and consistently improved the data-accuracy tradeoff for models developed across a range of training dataset sizes and several findings. For example, when evaluating the tool’s ability to develop tuberculosis models, data efficiency gains were more striking: models trained on the embeddings of just 45 images achieved non-inferiority to radiologists in detecting tuberculosis on an external validation dataset. For both tuberculosis and severe COVID-19 outcomes, we show that non-linear classifiers trained on frozen embeddings outperformed a model that was fine-tuned on the entire dataset.
Conclusion and Future Work To accelerate CXR modeling efforts with low data and computational requirements, we are releasing our CXR Foundation tool, along with scripts to train linear and nonlinear classifiers. Via these embeddings, this tool will allow researchers to jump-start CXR modeling efforts using simpler transfer learning methods. This approach can be particularly useful for predictive modeling using small datasets, and for adapting CXR models when there are distribution shifts in patient populations (whether over time or across different institutions). We are excited to continue working with partners, such as Northwestern Medicine and Apollo Hospitals, to explore the impact of this technology further. By enabling researchers with limited data and compute to develop CXR models, we're hoping more developers can solve the most impactful problems for their populations.
Acknowledgements Key contributors to this project at Google include Christina Chen, Yun Liu, Dilip Krishnan, Zaid Nabulsi, Atilla Kiraly, Arnav Agharwal, Eric Wu, Yuanzhen Li, Aaron Maschinot, Aaron Sarna, Jenny Huang, Marilyn Zhang, Charles Lau, Neeral Beladia, Daniel Tse, Krish Eswaran, and Shravya Shetty. Significant contributions and input were also made by collaborators Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Florencia Garcia-Vicente, and David Melnick. For the ChestX-ray14 dataset, we thank the NIH Clinical Center for making it publicly available. The authors would also like to acknowledge many members of the Google Health Radiology and labeling software teams. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study; Jonny Wong for coordinating the imaging annotation work; Craig Mermel and Akinori Mitani for providing feedback on the manuscript; Nicole Linton and Lauren Winer for feedback on the blogpost; and Tom Small for the animation.
Google is a leader in machine learning (ML) research with groups innovating across virtually all aspects of the field, from theory to application. We build machine learning systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, algorithm development, and more. Core to our approach is to actively engage with the broader research community by open-sourcing datasets and models, publishing our discoveries, and actively participating in leading conferences.
Google is proud to be a Diamond Sponsor of the thirty-ninth International Conference on Machine Learning (ICML 2022), a premier annual conference, which is being held this week in Baltimore, Maryland. Google has a strong presence at this year’s conference with over 100 accepted publications and active involvement in a number of workshops and tutorials. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community.
Registered for ICML 2022? We hope you’ll visit the Google booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Take a look below to learn more about the Google research being presented at ICML 2022 (Google affiliations in bold).
Organizing Committee
Tutorial Chairs include: Hanie Sedghi
Emeritus Members include: Andrew McCallum
Board Members include: Hugo Larochelle, Corinna Cortes
Publications
Individual Preference Stability for Clustering Saba Ahmadi, Pranjal Awasthi, Samir Khuller, Matthäus Kleindessner, Jamie Morgenstern, Pattara Sukprasert, Ali Vakilian
Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning Utku Evci, Vincent Dumoulin, Hugo Larochelle, Michael Mozer
H-Consistency Bounds for Surrogate Loss Minimizers Pranjal Awasthi, Anqi Mao, Mehryar Mohri, Yutao Zhong
Cooperative Online Learning in Stochastic and Adversarial MDPs Tal Lancewicki, Aviv Rosenberg, Yishay Mansour
Do More Negative Samples Necessarily Hurt in Contrastive Learning? Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath
Deletion Robust Submodular Maximization Over Matroids Paul Dütting, Federico Fusco*, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam
Tight and Robust Private Mean Estimation with Few Users Hossein Esfandiari, Vahab Mirrokni, Shyam Narayanan*
Generative Trees: Adversarial and Copycat Richard Nock, Mathieu Guillame-Bert
Agnostic Learnability of Halfspaces via Logistic Loss Ziwei Ji*, Kwangjun Ahn*, Pranjal Awasthi, Satyen Kale, Stefani Karp
Adversarially Trained Actor Critic for Offline Reinforcement Learning Ching-An Cheng, Tengyang Xie, Nan Jiang, Alekh Agarwal
Unified Scaling Laws for Routed Language Models Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc'Aurelio Ranzato, Jack Rae, Erich Elsen, Koray Kavukcuogu, Karen Simonyan
Large Batch Experience Replay Thibault Lahire, Matthieu Geist, Emmanuel Rachelson
Robust Training of Neural Networks Using Scale Invariant Architectures Zhiyuan Li*, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar
The Poisson Binomial Mechanism for Unbiased Federated Learning with Secure Aggregation Wei-Ning Chen, Ayfer Ozgur, Peter Kairouz
Global Optimization Networks Sen Zhao, Erez Louidor, Maya Gupta
A Joint Exponential Mechanism for Differentially Private Top-k Jennifer Gillenwater, Matthew Joseph, Andres Munoz Medina, Mónica Ribero
On the Practicality of Deterministic Epistemic Uncertainty Janis Postels, Mattia Segu, Tao Sun, Luc Van Gool, Fisher Yu, Federico Tombari
Balancing Discriminability and Transferability for Source-Free Domain Adaptation Jogendra Nath Kundu, Akshay Kulkarni, Suvaansh Bhambri, Deepesh Mehta, Shreyas Kulkarni, Varun Jampani, Venkatesh Babu Radhakrishnan
Transfer and Marginalize: Explaining Away Label Noise with Privileged Information Mark Collier, Rodolphe Jenatton, Efi Kokiopoulou, Jesse Berent
In Defense of Dual-Encoders for Neural Ranking Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Seungyeon Kim, Sashank Jakkam Reddi, Sanjiv Kumar
Surrogate Likelihoods for Variational Annealed Importance Sampling Martin Jankowiak, Du Phan
Translatotron 2: High-Quality Direct Speech-to-Speech Translation with Voice Preservation (see blog post) Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz
Differentially Private Approximate Quantiles Haim Kaplan, Shachar Schnapp, Uri Stemmer
Continuous Control with Action Quantization from Demonstrations Robert Dadashi, Léonard Hussenot, Damien Vincent, Sertan Girgin, Anton Raichuk, Matthieu Geist, Olivier Pietquin
Data Scaling Laws in NMT: The Effect of Noise and Architecture Yamini Bansal*, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat
Debiaser Beware: Pitfalls of Centering Regularized Transport Maps Aram-Alexandre Pooladian, Marco Cuturi, Jonathan Niles-Weed
A Context-Integrated Transformer-Based Neural Network for Auction Design Zhijian Duan, Jingwu Tang, Yutong Yin, Zhe Feng, Xiang Yan, Manzil Zaheer, Xiaotie Deng
Algorithms for the Communication of Samples Lucas Theis, Noureldin Yosri
Being Properly Improper Tyler Sypherd, Richard Nock, Lalitha Sankar
Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation Chris Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan
Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error Scott Fujimoto, David Meger, Doina Precup, Ofir Nachum, Shixiang Shane Gu
Public Data-Assisted Mirror Descent for Private Model Training Ehsan Amid, Arun Ganesh*, Rajiv Mathews, Swaroop Ramaswamy, Shuang Song, Thomas Steinke, Vinith M. Suriyakumar*, Om Thakkar, Abhradeep Thakurta
Deep Hierarchy in Bandits Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh
Scalable Deep Reinforcement Learning Algorithms for Mean Field Games Mathieu Lauriere, Sarah Perrin, Sertan Girgin, Paul Muller, Ayush Jain, Theophile Cabannes, Georgios Piliouras, Julien Perolat, Romuald Elie, Olivier Pietquin, Matthieu Geist
Faster Privacy Accounting via Evolving Discretization Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi
HyperPrompt: Prompt-Based Task-Conditioning of Transformers Yun He*, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, Heng-Tze Cheng, Ed H. Chi
Blocks Assemble! Learning to Assemble with Large-Scale Structured Reinforcement Learning Seyed Kamyar, Seyed Ghasemipour, Daniel Freeman, Byron David, Shixiang Shane Gu, Satoshi Kataoka, Igor Mordatch
Latent Diffusion Energy-Based Model for Interpretable Text Modelling Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu
On the Optimization Landscape of Neural Collapse Under MSE Loss: Global Optimality with Unconstrained Features Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, Zhihui Zhu
Efficient Reinforcement Learning in Block MDPs: A Model-Free Representation Learning Approach Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun
Robust Training Under Label Noise by Over-Parameterization Sheng Liu, Zhihui Zhu, Qing Qu, Chong You
FriendlyCore: Practical Differentially Private Aggregation Eliad Tsfadia, Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer
Adaptive Data Analysis with Correlated Observations Aryeh Kontorovich, Menachem Sadigurschi,Uri Stemmer
A Resilient Distributed Boosting Algorithm Yuval Filmus, Idan Mehalel, Shay Moran
On Learning Mixture of Linear Regressions in the Non-Realizable Setting Avishek Ghosh, Arya Mazumdar,Soumyabrata Pal, Rajat Sen
Online and Consistent Correlation Clustering Vincent Cohen-Addad, Silvio Lattanzi, Andreas Maggiori, Nikos Parotsidis
From Block-Toeplitz Matrices to Differential Equations on Graphs: Towards a General Theory for Scalable Masked Transformers Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian Weller, Thomas Weingarten
Parsimonious Learning-Augmented Caching Sungjin Im, Ravi Kumar, Aditya Petety, Manish Purohit
General-Purpose, Long-Context Autoregressive Modeling with Perceiver AR Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, Jesse Engel
Conformal Prediction Sets with Limited False Positives Adam Fisch, Tal Schuster, Tommi Jaakkola, Regina Barzilay
Dialog Inpainting: Turning Documents into Dialogs Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu
Benefits of Overparameterized Convolutional Residual Networks: Function Approximation Under Smoothness Constraint Hao Liu, Minshuo Chen, Siawpeng Er, Wenjing Liao, Tong Zhang, Tuo Zhao
Congested Bandits: Optimal Routing via Short-Term Resets Pranjal Awasthi, Kush Bhatia, Sreenivas Gollapudi, Kostas Kollias
Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance Zhuoning Yuan, Yuexin Wu, Zihao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, Tianbao Yang
Examining Scaling and Transfer of Language Model Architectures for Machine Translation Biao Zhang*, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (see blog post) Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui
How to Leverage Unlabeled Data in Offline Reinforcement Learning? Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine
Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning Harley Wiltzer, David Meger, Marc G. Bellemare
On the Robustness of CountSketch to Adaptive Inputs Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, Uri Stemmer
Model Selection in Batch Policy Optimization Jonathan N. Lee, George Tucker, Ofir Nachum, Bo Dai
The Fundamental Price of Secure Aggregation in Differentially Private Federated Learning Wei-Ning Chen, Christopher A. Choquette-Choo, Peter Kairouz, Ananda Theertha Suresh
Linear-Time Gromov Wasserstein Distances Using Low Rank Couplings and Costs Meyer Scetbon, Gabriel Peyré, Marco Cuturi*
Active Sampling for Min-Max Fairness Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, Chris Russell, Jie Zhang
Making Linear MDPs Practical via Contrastive Representation Learning Tianjun Zhang, Tongzheng Ren, Mengjiao Yang, Joseph E. Gonzalez, Dale Schuurmans, Bo Dai
Achieving Minimax Rates in Pool-Based Batch Active Learning Claudio Gentile, Zhilei Wang, Tong Zhang
Private Adaptive Optimization with Side Information Tian Li, Manzil Zaheer, Sashank J. Reddi, Virginia Smith
Self-Supervised Learning With Random-Projection Quantizer for Speech Recognition Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu
Wide Bayesian Neural Networks Have a Simple Weight Posterior: Theory and Accelerated Sampling Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein
The State of Sparse Training in Deep Reinforcement Learning Laura Graesser, Utku Evci, Erich Elsen, Pablo Samuel Castro
Constrained Discrete Black-Box Optimization Using Mixed-Integer Programming Theodore P. Papalexopoulos, Christian Tjandraatmadja, Ross Anderson, Juan Pablo Vielma, David Belanger
Massively Parallel k-Means Clustering for Perturbation Resilient Instances Vincent Cohen-Addad, Vahab Mirrokni, Peilin Zhong
What Language Model Architecture and Pre-training Objective Works Best for Zero-Shot Generalization? Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel
Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt
Synergy and Symmetry in Deep Learning: Interactions Between the Data, Model, and Inference Algorithm Lechao Xiao, Jeffrey Pennington
Fast Finite Width Neural Tangent Kernel Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz
The Combinatorial Brain Surgeon: Pruning Weights that Cancel One Another in Neural Networks Xin Yu, Thiago Serra, Srikumar Ramalingam, Shandian Zhe
Bayesian Imitation Learning for End-to-End Mobile Manipulation Yuqing Du, Daniel Ho, Alexander A. Alemi, Eric Jang, Mohi Khansari
HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning Andrey Zhmoginov, Mark Sandler, Max Vladymyrov
Marginal Distribution Adaptation for Discrete Sets via Module-Oriented Divergence Minimization Hanjun Dai, Mengjiao Yang, Yuan Xue, Dale Schuurmans, Bo Dai
Correlated Quantization for Distributed Mean Estimation and Optimization Ananda Theertha Suresh, Ziteng Sun, Jae Hun Ro, Felix Yu
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents Wenlong Huang, Pieter Abbeel, Deepak Pathak, Igor Mordatch
Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime Leonardo Cunha, Gauthier Gidel, Fabian Pedregosa, Damien Scieur, Courtney Paquette
Learning Iterative Reasoning through Energy Minimization Yilun Du, Shuang Li, Josh Tenenbaum, Igor Mordatch
Interactive Correlation Clustering with Existential Cluster Constraints Rico Angell, Nicholas Monath, Nishant Yadav, Andrew McCallum
Building Robust Ensembles via Margin Boosting Dinghuai Zhang, Hongyang Zhang, Aaron Courville, Yoshua Bengio, Pradeep Ravikumar, Arun Sai Suggala
Probabilistic Bilevel Coreset Selection Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, Tong Zhang
Model Agnostic Sample Reweighting for Out-of-Distribution Learning Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, Tong Zhang
Sparse Invariant Risk Minimization Xiao Zhou, Yong Lin, Weizhong Zhang, Tong Zhang
RUMs from Head-to-Head Contests Matteo Almanza, Flavio Chierichetti, Ravi Kumar, Alessandro Panconesi, Andrew Tomkins
A Parametric Class of Approximate Gradient Updates for Policy Optimization Ramki Gummadi, Saurabh Kumar, Junfeng Wen, Dale Schuurmans
On Implicit Bias in Overparameterized Bilevel Optimization Paul Vico, Jonathan Lorraine, Fabian Pedregosa, David Duvenaud, Roger Grosse
Feature and Parameter Selection in Stochastic Linear Bandits Ahmadreza Moradipari, Berkay Turan, Yasin Abbasi-Yadkori, Mahnoosh Alizadeh, Mohammad Ghavamzadeh
Neural Network Poisson Models for Behavioural and Neural Spike Train Data Moein Khajehnejad, Forough Habibollahi, Richard Nock, Ehsan Arabzadeh, Peter Dayan and Amir Dezfouli
Deep Equilibrium Networks are Sensitive to Initialization Statistics Atish Agarwala, Samuel Schoenholz
A Regret Minimization Approach to Multi-Agent Control Udaya Ghai, Udari Madhushani, Naomi Leonard, Elad Hazan
Transformer Quality in Linear Time Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le
Workshops
Shift Happens: Crowdsourcing Metrics and Test Datasets Beyond ImageNet Organizing Committee includes: Roland S. Zimmerman Invited Speakers include: Chelsea Finn, Lucas Beyer
Machine Learning for Audio Synthesis Organizing Committee includes: Yu Zhang Invited Speakers include: Chris Donahue
New Frontiers in Adversarial Machine Learning Organizing Committee includes: Sanmi Koyejo
Spurious Correlations, Invariance, and Stability (SIC) Organizing Committee includes: Victor Veitch
DataPerf: Benchmarking Data for Data-Centric AI Organizing Committee includes: Lora Aroyo, Peter Mattson, Praveen Paritosh DataPerf Speakers include: Lora Aroyo, Peter Mattson, Praveen Paritosh Invited Speakers include: Jordi Pont-Tuset
Machine Learning for Astrophysics Invited Speakers include: Dustin Tran
Dynamic Neural Networks Organizing Committee includes: Carlos Riquelme Panel Chairs include: Neil Houlsby
Interpretable Machine Learning in Healthcare (IMLH) Organizing Committee includes: Ramin Zabih Invited Speakers include: Been Kim
Human-Machine Collaboration and Teaming Invited Speakers include: Fernanda Viégas, Martin Wattenberg, Yuhuai (Tony) Wu
Pre-training: Perspectives, Pitfalls, and Paths Forward Organizing Committee includes: Hugo Larochelle, Chelsea Finn Invited Speakers include: Hanie Sedgh, Charles Sutton
Responsible Decision Making in Dynamic Environments Invited Speakers include: Craig Boutilier
Principles of Distribution Shift (PODS) Organizing Committee includes: Hossein Mobahi
Hardware-Aware Efficient Training (HAET) Invited Speakers include: Tien-Ju Yang
Updatable Machine Learning Invited Speakers include: Chelsea Finn, Nicolas Papernot Organizing Committee includes: Ananda Theertha Suresh, Badih Ghazi, Chiyuan Zhang, Kate Donahue, Peter Kairouz, Ziteng Sun
Knowledge Retrieval and Language Models Invited Speakers include: Fernando Diaz, Quoc Le, Kenton Lee, Ellie Pavlick Organizing Committee includes: Urvashi Khandelwal, Chiyuan Zhang
Theory and Practice of Differential Privacy Organizing Committee includes: Badih Ghazi, Matthew Joseph, Peter Kairouz, Om Thakkar, Thomas Steinke, Ziteng Sun
Beyond Bayes: Paths Towards Universal Reasoning Systems Invited Speakers include: Charles Sutton Spotlight Talk: Language Model Cascades | David Dohan, Winnie Xu, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, Charles Sutton
Safe Learning for Autonomous Driving (SL4AD) Invited Speakers include: Chelsea Finn