Skip to main content

An Optimistic Perspective on Offline Reinforcement Learning

“The potential for off-policy learning remains tantalizing, the best way to achieve it still a mystery.”
Sutton & Barto
Most reinforcement learning (RL) algorithms assume that an agent actively interacts with an online environment to learn from its own collected experience. These algorithms are challenging to apply to complex real-world problems (such as robotics and autonomous driving) since extensive data collection from the real world can be extremely sample inefficient and lead to unintended behavior, while those operating in simulation require high-fidelity simulators that are challenging to build. However, for many real-world RL applications, there already exist a large amount of previously collected interaction data which can be utilized to make RL feasible for those problems, and enable better generalization by incorporating diverse prior experiences.

Existing interaction data can be used effectively using offline RL, which is the fully off-policy RL setting in which an agent is trained from a fixed dataset of logged experiences, without any further interactions with the environment. Offline RL can help (1) pretrain an RL agent using existing data, (2) empirically evaluate RL algorithms based on their ability to utilize a fixed dataset of interactions, and (3) deliver real-world impact. However, offline RL is considered challenging due to the distribution mismatch between online interactions and any fixed dataset of logged interactions, i.e., when the learned agent takes an action different from the data collection agent, we don’t know the reward that should be provided.
RL with online interactions vs. Offline RL.
In “An Optimistic Perspective on Offline RL”, we propose a simple experimental setup for offline RL on Atari 2600 games, based on logged experiences of a DQN agent. We demonstrate that it is possible to train agents with high returns that outperform the data collection agents using standard off-policy RL algorithms, without explicitly correcting for any distribution mismatch. We also develop a robust RL algorithm, called random ensemble mixture (REM), which shows promising results on offline RL. Overall, we present an optimistic perspective that robust RL algorithms trained on sufficiently large and diverse offline datasets can lead to high quality behaviour, strengthening the emerging data-driven RL paradigm. To facilitate the development and evaluation of offline RL methods, we are also publicly releasing the DQN Replay Dataset and have open-sourced our code. More details can be found at

A Primer on Off-policy and Offline RL
We summarize various approaches to RL below:
Online, off-policy RL agents, such as DQN, achieve human-level performance on Atari 2600 games by just observing the game screen, without any explicit knowledge about the game. DQN estimates the effectiveness of an action at a given state of the environment in terms of maximum achievable future rewards (i.e., Q-values). Furthermore, recent distributional RL agents, such as QR-DQN, model the entire distribution of probable future rewards, rather than a single expected value for each state-action pair. Agents such as DQN and QR-DQN are considered “online” because they alternate between optimizing a policy (how an agent acts at a given state) and using that policy to collect more data.

In principle, off-policy RL agents can learn from data collected by any policy, not just the policy being optimized. However, in the offline RL setting, recent work presents a discouraging view that standard off-policy agents diverge or otherwise yield poor performance. To fix this, previous work proposes remedies by regularizing the learned policy to stay close to the dataset of offline interactions.

The DQN Replay Dataset for Offline RL
In this work, we revisit offline RL by first creating the DQN Replay Dataset. This dataset is generated using DQN agents trained on 60 Atari 2600 games for 200 million frames each, while using sticky actions (with 25% probability that the agent’s previous action is executed instead of the current action) to make the problem more challenging. For each of the 60 games, we train 5 DQN agents with different random initializations, and store all of the (state, action, reward, next state) tuples encountered during training into 5 replay datasets per game, resulting in a total of 300 datasets.
Offline RL on Atari games using the DQN Replay Dataset.
The DQN Replay Dataset can then be used for training offline RL agents, without any interaction with the environment during training. Each game replay dataset is approximately 3.5 times larger than ImageNet and includes samples from all of the intermediate policies seen during the optimization of online DQN.

Training Offline Agents on the DQN Replay Dataset
We trained offline variants of DQN and distributional QR-DQN on the DQN Replay Dataset. Although the offline datasets contain data experienced by a DQN agent improving over time as training progresses, we compared the performance of offline agents against the best performing online DQN agent obtained after training (i.e., a fully-trained DQN). For each game, we evaluated the 5 offline agents trained (one per dataset), using online returns, reporting the best averaged performance.

Offline DQN underperforms fully-trained online DQN on all except a few games, where it achieves higher scores with the same amount of data. Offline QR-DQN, on the other hand, outperforms offline DQN and fully-trained DQN on most of the games. These results demonstrate that it is possible to optimize strong agents offline using standard deep RL algorithms. Furthermore, the disparity between the performance of offline QR-DQN and DQN indicates the difference in their ability to exploit offline data.
Offline DQN. Normalized improvement over a fully-trained DQN, per game, of offline DQN trained using DQN replay. On the normalized scale, fully-trained DQN corresponds to 100% performance while random agent corresponds to 0%.
Offline QR-DQN. Normalized performance improvement (in %) over a fully-trained DQN agent, per game, of offline QR-DQN trained offline using DQN replay.
Introducing Two Robust Offline RL Agents
In online RL, an agent chooses actions that it thinks will lead to high rewards, and then receives corrective feedback. Since it is not possible to collect additional data in offline RL, it is essential to reason about generalization using a fixed dataset. Leveraging methods from supervised learning that use an ensemble of models to improve generalization, we present two new offline RL agents:
  • Ensemble-DQN is a simple extension of DQN that trains multiple Q-value estimates and averages them for evaluation.
  • Random Ensemble Mixture (REM) is an easy to implement extension of DQN inspired by Dropout. The key intuition behind REM is that if one has access to multiple estimates of Q-values, then a weighted combination of the Q-value estimates is also an estimate for Q-values. Accordingly, in each training step, REM randomly combines multiple Q-value estimates and uses this random combination for robust training.
Neural Network architectures for DQN, distributional QR-DQN and the expected RL variants with the same multi-head QR-DQN architecture, i.e., Ensemble-DQN and REM. In QR-DQN, each head (red rectangles) corresponds to a specific fraction of the return distribution, while in the proposed variants, each head approximates the Q-function.
To utilize the DQN Replay Dataset more efficiently, we train offline agents for five times as many training iterations as online DQN and report their performance below. Offline REM outperforms offline DQN and offline QR-DQN. The comparison with fully-trained online C51, a strong distributional agent, illustrates that the gains from offline REM are more than the gains from C51.
Offline REM vs. baselines. Median normalized scores averaged over 5 runs across 60 Atari games of offline agents trained using DQN replay for 5x iterations, compared to online DQN.
Using the standard training protocols on Atari, online REM performs on par with QR-DQN in the standard online RL setting. This suggests that we can use the insights gained from the DQN Replay Dataset and the offline RL setting to build effective online RL methods.
Online REM vs. baselines. Median normalized evaluation scores averaged over 5 runs (shown as traces) across stochastic 60 Atari 2600 games of online agents trained for 200 million frames. Online REM with 4 Q-networks performs comparably to online QR-DQN.
Comparison of Results: Important Factors in Offline RL
The discrepancy between these results and prior work that reports failure of standard RL agents in the offline setting could be attributed to the following factors:
  • Offline Dataset Size: We trained offline QR-DQN and REM with reduced data obtained via randomly subsampling the entire DQN Replay Dataset, maintaining the same data distribution. Analogous to supervised learning, performance tends to increase as the size of data increases. With only 10% of the entire dataset, REM and QR-DQN approximately recover the performance of fully-trained DQN.
  • Offline Dataset Composition: We trained offline RL agents on the first 20 million frames per game in the DQN Replay Dataset. Offline REM and QR-DQN outperform the best policy in this lower quality dataset, indicating that standard RL agents work well in the offline setting with sufficiently diverse datasets.
    Offline RL with Lower Quality Dataset. REM and QR-DQN trained on offline data collected from DQN trained for 20 iterations (by using the first 20M frames from each game replay dataset). The horizontal line shows the performance of best policy in this dataset, which is significantly worse than fully-trained DQN.
  • Offline Algorithm Choice: There are claims that standard off-policy agents are ineffective on continuous control tasks when trained offline. However, we found that recent continuous control agents, such as TD3, perform comparably to a sophisticated offline agent when trained on large and diverse offline datasets.
Future Work. Our results emphasize the need for a rigorous characterization of the role of generalization due to neural networks when learning from offline data collected from a large mixture of diverse policies. Benchmarking offline RL with various data collection strategies by subsampling DQN replay (e.g., first / last k million frames) is another important direction. We currently employ online policy evaluation, however, “true” offline RL requires offline policy evaluation for hyperparameter tuning and early stopping. Finally, model-based RL and self-supervised learning approaches are also promising for offline RL.

This research was conducted in collaboration with Dale Schuurmans. We’d like to thank members of the Google Research, Brain Team for valuable discussions. A prior version of this work was presented as a contributed talk at NeurIPS 2019 DRL Workshop.

Twitter Facebook