Reinforcement learning systems can make decisions in one of two ways. In the modelbased approach, a system uses a predictive model of the world to ask questions of the form “what will happen if I do x?” to choose the best x^{1}. In the alternative modelfree approach, the modeling step is bypassed altogether in favor of learning a control policy directly. Although in practice the line between these two techniques can become blurred, as a coarse guide it is useful for dividing up the space of algorithmic possibilities.
Predictive models can be used to ask “what if?” questions to guide future decisions.
The natural question to ask after making this distinction is whether to use such a predictive model. The field has grappled with this question for quite a while, and is unlikely to reach a consensus any time soon. However, we have learned enough about designing modelbased algorithms that it is possible to draw some general conclusions about best practices and common pitfalls. In this post, we will survey various realizations of modelbased reinforcement learning methods. We will then describe some of the tradeoffs that come into play when using a learned predictive model for training a policy and how these considerations motivate a simple but effective strategy for modelbased reinforcement learning. The latter half of this post is based on our recent paper on modelbased policy optimization, for which code is available here.
Modelbased techniques
Below, modelbased algorithms are grouped into four categories to highlight the range of uses of predictive models. For the comparative performance of some of these approaches in a continuous control setting, this benchmarking paper is highly recommended.
Analytic gradient computation
Assumptions about the form of the dynamics and cost function are convenient because they can yield closedform solutions for locally optimal control, as in the LQR framework. Even when these assumptions are not valid, receding–horizon control can account for small errors introduced by approximated dynamics. Similarly, dynamics models parametrized as Gaussian processes have analytic gradients that can be used for policy improvement. Controllers derived via these simple parametrizations can also be used to provide guiding samples for training more complex nonlinear policies.
Samplingbased planning
In the fully general case of nonlinear dynamics models, we lose guarantees of local optimality and must resort to sampling action sequences. The simplest version of this approach, random shooting, entails sampling candidate actions from a fixed distribution, evaluating them under a model, and choosing the action that is deemed the most promising. More sophisticated variants iteratively adjust the sampling distribution, as in the crossentropy method (CEM; used in PlaNet, PETS, and visual foresight) or path integral optimal control (used in recent modelbased dexterous manipulation work).
In discreteaction settings, however, it is more common to search over tree structures than to iteratively refine a single trajectory of waypoints. Common treebased search algorithms include MCTS, which has underpinned recent impressive results in games playing, and iterated width search. Samplingbased planning, in both continuous and discrete domains, can also be combined with structured physicsbased, objectcentric priors.
Modelbased data generation
An important detail in many machine learning success stories is a means of artificially increasing the size of a training set. It is difficult to define a manual data augmentation procedure for policy optimization, but we can view a predictive model analogously as a learned method of generating synthetic data. The original proposal of such a combination comes from the Dyna algorithm by Sutton, which alternates between model learning, data generation under a model, and policy learning using the model data. This strategy has been combined with iLQG, model ensembles, and metalearning; has been scaled to image observations; and is amenable to theoretical analysis. A close cousin to modelbased data generation is the use of a model to improve target value estimates for temporal difference learning.
Valueequivalence prediction
A final technique, which does not fit neatly into modelbased versus modelfree categorization, is to incorporate computation that resembles modelbased planning without supervising the model’s predictions to resemble actual states. Instead, plans under the model are constrained to match trajectories in the real environment only in their predicted cumulative reward. These valueequivalent models have shown to be effective in highdimensional observation spaces where conventional modelbased planning has proven difficult.
Tradeoffs of model data
In what follows, we will focus on the data generation strategy for modelbased reinforcement learning. It is not obvious whether incorporating modelgenerated data into an otherwise modelfree algorithm is a good idea. Modeling errors could cause diverging temporaldifference updates, and in the case of linear approximation, model and value fitting are equivalent. However, it is easier to motivate model usage by considering the empirical generalization capacity of predictive models, and such a modelbased augmentation procedure turns out to be surprisingly effective in practice.
The Good News
A natural way of thinking about the effects of modelgenerated data begins with the standard objective of reinforcement learning:
which says that we want to maximize the expected cumulative discounted rewards (r(s_t, a_t)) from acting according to a policy (pi) in an environment governed by dynamics (p). It is important to pay particular attention to the distributions over which this expectation is taken.^{2} For example, while the expectation is supposed to be taken over trajectories from the current policy (pi), in practice many algorithms reuse trajectories from an old policy (pi_text{old}) for improved sampleefficiency. There has been much algorithm development dedicated to correcting for the issues associated with the resulting offpolicy error.
Using modelgenerated data can also be viewed as a simple modification of the sampling distribution. Incorporating model data into policy optimization amounts to swapping out the true dynamics (p) with an approximation (hat{p}). The model bias introduced by making this substitution acts analogously to the offpolicy error, but it allows us to do something rather useful: we can query the model dynamics (hat{p}) at any state to generate samples from the current policy, effectively circumventing the offpolicy error.
If model usage can be viewed as trading between offpolicy error and model bias, then a straightforward way to proceed would be to compare these two terms. However, estimating a model’s error on the current policy’s distribution requires us to make a statement about how that model will generalize. While worstcase bounds are rather pessimistic here, we found that predictive models tend to generalize to the state distributions of future policies well enough to motivate their usage in policy optimization.
Generalization of learned models, trained on samples from a datacollecting policy (pi_D) , to the state distributions of future policies (pi) seen during policy optimization. Increasing the training set size not only improves performance on the training distribution, but also on nearby distributions.
The Bad News
The above result suggests that the singlestep predictive accuracy of a learned model can be reliable under policy shift. The catch is that most modelbased algorithms rely on models for much more than singlestep accuracy, often performing modelbased rollouts equal in length to the task horizon in order to properly estimate the state distribution under the model. When predictions are strung together in this manner, small errors compound over the prediction horizon.
A 450step action sequence rolled out under a learned probabilistic model, with the figure’s position depicting the mean prediction and the shaded regions corresponding to one standard deviation away from the mean. The growing uncertainty and deterioration of a recognizable sinusoidal motion underscore accumulation of model errors.
<!– –>
Analyzing the tradeoff
This qualitative tradeoff can be made more precise by writing a lower bound on a policy’s true return in terms of its modelestimated return:
A lower bound on a policy’s
true return in terms of its expected model return, the model rollout length, the policy divergence, and the model error on the current policy’s state distribution.
As expected, there is a tension involving the model rollout length. The model serves to reduce offpolicy error via the terms exponentially decreasing in the rollout length (k). However, increasing the rollout length also brings about increased discrepancy proportional to the model error.
Modelbased policy optimization
We have two main conclusions from the above results:
 predictive models can generalize well enough for the incurred model bias to be worth the reduction in offpolicy error, but
 compounding errors make longhorizon model rollouts unreliable.
A simple recipe for combining these two insights is to use the model only to perform short rollouts from all previously encountered real states instead of fulllength rollouts from the initial state distribution. Variants of this procedure have been studied in prior works dating back to the classic Dyna algorithm, and we will refer to it generically as modelbased policy optimization (MBPO), which we summarize in the pseudocode below.
We found that this simple procedure, combined with a few important design decisions like using probabilistic model ensembles and a stable offpolicy modelfree optimizer, yields the best combination of sample efficiency and asymptotic performance. We also found that MBPO avoids the pitfalls that have prevented recent modelbased methods from scaling to higherdimensional states and longhorizon tasks.
Learning curves of MBPO and five prior works on continuous control benchmarks. MBPO reaches the same asymptotic performance as the best modelfree algorithms, often with only onetenth of the data, and scales to state dimensions and horizon lengths that cause previous modelbased algorithms to fail.
This post is based on the following paper:

When to Trust Your Model: ModelBased Policy Optimization
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine
Neural Information Processing Systems (NeurIPS), 2019.
Opensource code
I would like to thank Michael Chang and Sergey Levine for their valuable feedback.

In reinforcement learning, this variable is typically denoted by a for “action.” In control theory, it is denoted by u for “upravleniye” (or more faithfully, “управление”), which I am told is “control” in Russian.↩

We have omitted the initial state distribution (s_0 sim rho(cdot)) to focus on those distributions affected by incorporating a learned model.↩
References
 KR Allen, KA Smith, and JB Tenenbaum. The tools challenge: rapid trialanderror learning in physical problem solving. CogSci 2019.
 B Amos, IDJ Rodriguez, J Sacks, B Boots, JZ Kolter. Differentiable MPC for endtoend planning and control. NeurIPS 2018.
 T Anthony, Z Tian, and D Barber. Thinking fast and slow with deep learning and tree search. NIPS 2017.
 K Asadi, D Misra, S Kim, and ML Littman. Combating the compoundingerror problem with a multistep model. arXiv 2019.
 V Bapst, A SanchezGonzalez, C Doersch, KL Stachenfeld, P Kohli., PW Battaglia, and JB Hamrick. Structured agents for physical construction. ICML 2019.
 ZI Botev, DP Kroese, RY Rubinstein, and P L’Ecuyer. The crossentropy method for optimization. Handbook of Statistics, volume 31, chapter 3. 2013.
 J Buckman, D Hafner, G Tucker, E Brevdo, and H Lee. Sampleefficient reinforcement learning with stochastic ensemble value expansion. NeurIPS 2018.
 K Chua, R Calandra, R McAllister, and S Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. NeurIPS 2018.
 I Clavera, J Rothfuss, J Schulman, Y Fujita, T Asfour, and P Abbeel. Modelbased reinforcement learning via metapolicy optimization. CoRL 2018.
 R Coulom. Efficient selectivity and backup operators in MonteCarlo tree search. CG 2006.
 M Deisenroth and CE Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. ICML 2011.
 F Ebert, C Finn, S Dasari, A Xie, A Lee, and S Levine. Visual foresight: modelbased deep reinforcement learning for visionbased robotic control. arXiv 2018.
 V Feinberg, A Wan, I Stoica, MI Jordan, JE Gonzalez, and S Levine. Modelbased value estimation for efficient modelfree reinforcement learning. ICML 2018.
 C Finn and S Levine. Deep visual foresight for planning robot motion. ICRA 2017.
 S Gu, T Lillicrap, I Sutskever, and S Levine. Continuous deep Qlearning with modelbased acceleration. ICML 2016.
 D Ha and J Schmidhuber. World models. NeurIPS 2018.
 T Haarnoja, A Zhou, P Abbeel, and S Levine. Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. ICML 2018.
 D Hafner, T Lillicrap, I Fischer, R Villegas, D Ha, H Lee, and J Davidson. Learning latent dynamics for planning from pixels. ICML 2019.
 LP Kaelbling, ML Littman, and AP Moore. Reinforcement learning: a survey. JAIR 1996.
 L Kaiser, M Babaeizadeh, P Milos, B Osinski, RH Campbell, K Czechowski, D Erhan, C Finn, P Kozakowsi, S Levine, R Sepassi, G Tucker, and H Michalewski. Modelbased reinforcement learning for Atari. arXiv 2019.
 A Krizhevsky, I Sutskever, and GE Hinton. ImageNet classification with deep convolutional neural networks. NIPS 2012.
 T Kurutach, I Clavera, Y Duan, A Tamar, and P Abbeel. Modelensemble trustregion policy optimization. ICLR 2018.
 S Levine and V Koltun. Guided policy search. ICML 2013.
 W Li and E Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. ICINCO 2004.
 N Lipovetzky, M Ramirez, and H Geffner. Classical planning with simulators: results on the Atari video games. IJCAI 2015.
 Y Luo, H Xu, Y Li, Y Tian, T Darrell, and T Ma. Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. ICLR 2019.
 R Munos, T Stepleton, A Harutyunyan, MG Bellemare. Safe and efficient offpolicy reinforcement learning. NIPS 2016.
 A Nagabandi, K Konoglie, S Levine, and V Kumar. Deep dynamics models for learning dexterous manipulation. arXiv 2019.
 A Nagabandi, GS Kahn, R Fearing, and S Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. ICRA 2018.
 J Oh, S Singh, and H Lee. Value prediction network. NIPS 2017.
 R Parr, L Li, G Taylor, C PainterWakefield, ML Littman. An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning. ICML 2008.
 D Precup, R Sutton, and S Singh. Eligibility traces for offpolicy policy evaluation. ICML 2000.
 J Schrittwieser, I Antonoglou, T Hubert, K Simonyan, L Sifre, S Schmitt, A Guez, E Lockhart, D Hassabis, T Graepel, T Lillicrap, and D Silver. Mastering Atari, Go, chess and shogi by planning with a learned model. arXiv 2019.
 D Silver, T Hubert, J Schrittwieser, I Antonoglou, M Lai, A Guez, M Lanctot, L Sifre, D Kumaran, T Graepel, TP Lillicrap, K Simonyan, and D Hassabis. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv 2017.
 RS Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. ICML 1990.
 E Talvitie. Selfcorrecting models for modelbased reinforcement learning. AAAI 2016.
 A Tamar, Y Wu, G Thomas, S Levine, and P Abbeel. <a href=https://arxiv.org/abs/1602.02867>Value iteration networks.</a> NIPS 2016.
 Y Tassa, T Erez, and E Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. IROS 2012.
 H van Hasselt, M Hessel, and J Aslanides. When to use parametric models in reinforcement learning? NeurIPS 2019.
 R Veerapaneni, JD CoReyes, M Chang, M Janner, C Finn, J Wu, JB Tenenbaum, and S Levine. Entity abstraction in visual modelbased reinforcement learning. CoRL 2019.
 T Wang, X Bao, I Clavera, J Hoang, Y Wen, E Langlois, S Zhang, G Zhang, P Abbeel, and J Ba. Benchmarking modelbased reinforcement learning. arXiv 2019.
 M Watter, JT Springenberg, J Boedecker, M Riedmiller. Embed to control: a locally linear latent dynamics model for control from raw images. NIPS 2015.
 G Williams, A Aldrich, and E Theodorou. Model predictive path integral control using covariance variable importance sampling. arXiv 2015.