My Reinforcement Learning Learnings

I spent a good chunk of my time over the last two years applying deep reinforcement learning techniques to create an AI that can play the CodeCraft real-time strategy game. My primary motivation was to learn how to tackle nontrivial problems with machine learning and become proficient with modern auto-differentiation frameworks. Thousands of experiment runs and hundreds of commits later I have much to learn still but like to think that I have picked up a trick or two. This blogpost gives an overview of the workflows and intuitions I adopted over the course of working on CodeCraft in the hope that they will prove useful to anyone else looking to pursue similar work. For a very different take on the same material that provides motivating examples for many of the ideas summarized here, check out my dark fantasy machine learning poem “Conjuring a CodeCraft Mind“.

r/reinforcementlearning (22 points, 2 comments)

Courtesy of xkcd

Core Loop

At a high level, the core loop for applying machine learning is straightforward: First, we form ideas about how we can improve results. We implement the idea that we think will have the highest return on investment. We run experiments to test that our implementation actually does what we think it does and to determine whether it results in any improvements. And analyzing the outcomes of the experiments often gives us new ideas for further improvements and allows us to keep iterating. In the remainder of this blog post, I will discuss all these steps in more detail and describe some of the tooling we can use to speed up the loop by making each step as effective as possible.


Most of my experience prior to pivoting to machine learning has been in distributed systems and other areas of software engineering. I found that many core software engineering skills carry over quite well to developing machine learning systems, but there are some key differences that pose novel challenges. Most of these derive from the fact that it is extremely difficult to find defects in machine learning code. At least in theory, conventional software is composed of well-defined modules that can be tested in isolation and most bugs cause obvious failures that can be systematically debugged by successively narrowing down the faulty subsystem. In contrast, the heart of machine learning code (aka software 2.0​1​) consists of a big heap of largely uninterpretable floating-point numbers which we send through a series of numerical transformations (that at best have an approximate mathematical justification relying on partially-met assumptions) in the hope of turning them into slightly improved floating-point numbers that perform better on some fuzzy objective.

Many bugs will not result in a program crash or even NaN values but just cause a subtle performance degradation that easily escapes detection. In some cases, these can be relatively easy to spot. They might also cause fairly specific performance degradations that give hints as to their location. For example, one of my policies was trained to control a game unit to collect resources and build new units, which worked very well until the point at which it started building new units at which point its movement became completely random. This turned out to be because features for the positions of units got randomly shuffled on each timestep. But in other cases, even when you know that there is a bug, it can be exceedingly difficult to track it down because it could originate from a subtle error in any part of the code which is then silently propagated. One severe performance degradation that I spent over a month trying to hunt down turned out to be caused by the accidental inclusion of a single unnormalized input feature that would sometimes have too large of a value and destabilize training in a way that could have been caused by a myriad of other mathematical errors anywhere in the code. (Just to give a few additional examples of bugs, and here’s some more, and oh my god so many bugs, bugs everywhere stahp it, STAHP IT!) Sometimes you can reveal a specific deficiency that implicates a particular region of code through carefully designed experiments, but often the best you can do is to carefully read through all the code dozens of times until you spot the mistake. Where is your God now, test-driven developers!

Running Experiments

The difficulty of debugging code is compounded by the fact that generally you won’t know a priori whether a change will result in a meaningful improvement. Most don’t, but all of them have a chance of introducing new bugs. To prevent our code from degrading over time and retain any chance of making progress, we need to run frequent experiments that clearly characterize the performance of our machine learning system. In an ideal world, we would run extensive experiments after every single change that can determine its effect with high statistical significance. Depending on your task this might be prohibitively expensive, and particularly in reinforcement learning (RL) reaching very high levels of statistical significance is often not feasible due to the high variance between runs. In this case, the pragmatic approach is to only periodically run extensive evaluations, come up with simplified versions of the target task that allow changes to be derisked more cheaply, and bias towards simplicity and only keeping additions that show a large benefit. In this regime, it is very easy for selection bias to creep into your baselines. To mitigate this, it useful to be able to rerun older versions of the code to determine whether an apparent performance degradation was caused by a regression, or whether the baseline run just got lucky and overestimated performance.

Progress on machine learning projects is often bottlenecked not by writing code but by running experiments to test different hypotheses. It is prudent to invest in a bit of tooling to make this part of the loop as frictionless as possible and allow you to run many experiments in parallel. At the start of my project to create an AI for CodeCraft, I wrote a little job runner which is neither particularly good nor novel but serves as a good example for what to look for in your own tooling (see also fastgpu which follows a similar philosophy and is actually maintained). My job runner consists of a short Python script that watches a directory for new “job files” that specify a git revision and a set of hyperparameters. Jobs are placed in a queue and run once a GPU becomes available. To run a job, the script creates a new temporary folder, clones my git repository at the specified revision, and starts a subprocess within that folder that executes the main training script. To schedule jobs, I just run python --params-file=params.yaml which determines the current git revision, reads the hyperparameters in params.yaml and creates a corresponding file in the job file directory (which may be on a remote machine accessible via ssh). The params.yaml file contains a list of jobs configs and typically looks something like this:

- hpset: standard
  adr_variety: [0.5, 0.3]
  lr: [0.001, 0.0003]
- hpset: standard
  repeat: 4
  steps: 300e6

The hpset: standard tells my training script to use a specific set of default hyperparameters. The special repeat parameter tells the job runner to spawn multiple identical runs for an experiment. When a hyperparameter is set to a list of different values, one experiment is spawned for each combination. So the params.yaml above will spawn a total of 8 experiment runs, 4 of which will run for 300 million steps with the default set of hyperparameters, and one additional run for all 4 combinations of the adr_variety and lr hyperparameters.

Collecting Results

When you are running many experiments, just keeping track of all the results becomes a challenge. My solution of choice is currently Weights & Biases (W&B) which allows you to record metrics during your experiment runs and then view them on W&B’s website (also see this long thread on Reddit for alternatives). All metrics are searchable on the run page and you can create custom dashboards that display all your key metrics. Here’s what this looks like for one of my CodeCraft runs (click image to enlarge):

When initializing the W&B client at the start of an experiment, you pass in a “config” dictionary with all the hyperparameters and other information relevant to the run. If you ever want to check some specific value on an older run, you can search for it on the run’s info page:

The project page works similarly to the run page and also allows you to search for and compare different (sets of) runs. I usually start several identical runs per experiment and my code automatically constructs a long descriptor string from the git revision and hyperparameter config used by a run which I group by on the run page to compare different experiments:

Another feature I find useful is the “run comparison” widget that allows you to quickly determine any differences between runs:

Live version on

Despite what it looks like, this blogpost has not actually been sponsored by W&B, I just think it’s neat.

Generating Hypotheses

The previous sections have given an overview of the mechanics of working on a machine learning project, but how do you actually come up with good ideas that allow you to make progress? For more mature applications like image classification, there are fairly well-established procedures for getting good results, but once you are moving closer to the research frontier and trying to solve tasks that are off the beaten path things get trickier. There are myriads of different tricks and techniques scattered across papers and blog posts, but until you’ve accumulated some experience it can be difficult to know what to look for and prioritize. In this section, I will go over all of the key components of reinforcement learning systems at a high level and try to give a sense of much they mattered in my work on CodeCraft.

Optimizer and RL Algorithm

At the foundation, we find the optimizer and RL algorithm. Finding robust improvements to these is quite rare and not something I can claim expertise on. When the goal is to solve a novel task, it’s best to stick with an existing method and, if at all possible, reuse an existing proven implementation since this kind of code is quite tricky to implement and debug (doing so is a good learning experience though). When working on CodeCraft, I just picked PPO and stuck with it since it’s simple and I figured that if it’s good enough for DoTA 2​2​, it’ll suffice for CodeCraft.

Network Architecture

Still fairly general but somewhat more task-specific is the network architecture. As with new optimizers or RL algorithms, coming up with entirely new kinds of networks that improve upon the state of the art is quite rare, but often you can get benefits from composing existing pieces together in a way that works particularly well for the task at hand.

One particularly reliable way of using domain knowledge to obtain large gains is to encode task-specific invariants into the network architecture by sharing weights or canonicalizing input features. Two examples: In CodeCraft, agents observe multiple different objects, but all of them share features for the x and y positions of the object. It stands to reason that the embeddings computed for positional features shouldn’t differ that much between e.g. allied drones and mineral crystals and that learning them once for all objects works better than learning them 4 times for all the different objects. And indeed, sharing a single subnetwork between all object types for processing position features results in a nice performance boost. Relatedly, we would expect that if we have a harvester game unit at position (0, 500) that decides to move towards a mineral crystal at position (500, 500), then it will still make that same choice if instead it is placed at position (0, -500) and the mineral crystal at position (500, -500). But our neural network has no way of knowing that it behave identically in these two situations and may in fact never generalize to positions that it has not seen during training. Just by making position features of objects relative to the game unit, in which case the position of the mineral crystal would be (500, 0) in both instances, the two equivalent situations look identical to the network and we get generalization completely for free. For CodeCraft, even the much smaller change of rotating all objects to align them with the current movement direction of each game unit results in dramatic sample efficiency improvements.

Further efficiencies can sometimes be obtained by experimenting with other details of the network architecture, though performance seems to be primarily a function of the total number of parameters​3,4​ and many other modifications don’t yield robust improvements​5​. In RL, networks tend to be much smaller than in other domains of machine learning, and increasing their size does not seem to be a reliable way of improving performance. Nonetheless, if the network is too small this can become a bottleneck. For cases where you can’t just reuse an existing architecture, here’s a basic recipe that should work reasonably well more often than not:

  • Start and end the network with a dense layer
  • Use ReLU activation functions
  • Add residual connections if the network is deep
  • Try putting LayerNorm or BatchNorm after every layer except the last one or two layers
  • Use a transformer to process sequences or sets of items
  • Use a transformer or LSTM in the penultimate layer to access information from past timesteps
  • Use either an attention or pooling layer to combine several items into a single item
  • Use a (deep) CNN for images or other features that are naturally represented by a 2D grid
  • When using high-level state (as opposed to images) as input features, your network often doesn’t have to be deep, even a single hidden layer might be sufficient

For some examples of architectures successfully applied to different RL tasks, see IMPALA​6​ (games, visual input), OpenAI Five​2​, AlphaStar​7​, or my CodeCraft architecture​8​ (real-time strategy), and Solving Rubik’s Cube with a Robot Hand​9​ (low-level motor control).


This might be obvious, but the primary consideration for choosing input features is that they give access to all information that is relevant to your agents. Features should also be normalized to zero mean and unit variance. Beyond that, you might find some smaller gains from engineering additional derived features that convey salient information that is not easily computable from the raw inputs. An example from CodeCraft includes the “distance to the edge of map” features which I added to help my policies avoid running into walls, though I’m not sure how much of a difference this actually made. A more clearly successful example from CodeCraft the inclusion of information about what parts of the map have already been scouted and the last position of enemy drones not currently visible, which lead to obvious qualitative changes in behavior and a large increase in performance. Using memory-like features like this is a common way of propagating information from past timesteps that avoids the fickleness of recurrent networks.


Improving the size and quality of your dataset is well known to be a reliable way to improve the performance of deep learning models. When performing reinforcement learning with a simulator that you can modify, you are uniquely positioned to shape the training environments in ways that can yield dramatic improvements to learning speed, robustness, and the set of capabilities that can be attained by your policies. Many of the issues experienced by existing reinforcement learning techniques relate either to their limited exploration capabilities or overfitting to insufficiently diverse environments. As discussed in more detail in a previous post, aggressively randomizing the environment and increasing its difficulty over time often makes it possible to sidestep these problems and train robust policies that can gradually acquire skills without running into hard exploration problems.


All across the stack, there are hyperparameters to be tuned (CodeCraft has accumulated about a hundred). I would be entirely unsurprised by the existence of small tweaks that yield 2x improvements on that project despite working on it for a long time, but for the most part, tuning hyperparameters is not a source of large enduring gains, fundamentally new capabilities, or deep insights. So it’s best not to spend too much time on it. At the same time, RL is notoriously finicky and if your hyperparameters are completely out of whack learning performance will be severely degraded. Often the hardest part is to get anything working at all, and once you see the first signs of life homing in on a good configuration is much more straightforward. So a good strategy at the start of a project is to first simplify the task until solving it is trivial and finding good hyperparameters is cheap and easy, and then successively increasing complexity and making occasional further adjustments to hyperparameters as necessary.

Briefly, some strategies for the most important hyperparameters:

  • Learning rate: When too low, the policy doesn’t change much and training will be arbitrarily slow, while too large a value causes instabilities that similarly inhibit learning and may be reflected in quickly decreasing entropy or large spikes in metrics such as the gradient norm, KL divergence, PPO clip range or the loss itself. The optimum point tends to be a learning rate that is as large as possible without running into instabilities. Of note, methods such as actor-critic have two largely independent loss terms for the policy and value function and you’ll need to adjust the corresponding coefficients to tune the effective learning rates for both losses. (I got something like a 3x-4x improvement in sample efficiency for free fairly late into my CodeCraft project by increasing the effective contribution of the policy loss #JustMLThings).
  • Another parameter that can be very important is the coefficient for the entropy loss which prevents the policy from quickly collapsing into a local minimum without exploring a wider range of strategies. If this collapse happens, you will see a rapid decline in the policy’s entropy that can be counteracted by increasing the entropy loss until the descent is more gradual. Once policies reach a certain level of performance, they may be limited by the randomness introduced by the entropy loss and plateau, so at some point it can be helpful to construct a schedule that starts with a high entropy loss that is reduced to 0 over the course of training.
  • Sample staleness: This is most often discussed in the context of off-policy methods that use samples collected with a policy different from the one being optimized, but staleness also affects methods that are nominally on-policy. Most on-policy RL methods alternate between an optimization phase that performs gradient descent on the policy, and a rollout phase that collects samples used in the subsequent optimization phase. If you increase the number of samples collected during the rollout phase, you will eventually reach a point of sharply diminishing returns where additional samples collected with the current version of the policy don’t contribute to further improvements. Sample efficiency is usually highest when the policy is updated frequently and rollouts are as short as possible. On the other hand, to maximize GPU efficiency (which grows roughly linearly with batch size up to some limit) you want many parallel environments, and to reduce bias when bootstrapping from value function estimates you want to take many sequential steps. How much sample staleness can be tolerated is highly task-dependent and also changes over the course of training, with more complex tasks and later training stages tolerating higher levels of staleness​10​. Finding the right rollout size for your specific task can make a large difference.
  • Closely related is the batch size used during optimization, which is also partly a tradeoff between GPU efficiency and ML performance. Especially early in training, learning can be much faster with small batch sizes, but particularly in RL larger batch sizes often help stabilize training and can allow for better final performance. Since the advantage estimates produce by RL methods tend to produce only faint training signal that is buried in a lot of noise, large batches are a key technique that reduces the variance in gradient estimates by averaging many noisy samples. This also means RL often tolerates and benefits from much larger batch sizes than other machine learning tasks. For example, in CodeCraft I eventually ended up combining all 16384 samples from the rollout phase into a single batch, and projects like OpenAI Five​2​ have used batches with millions of timesteps. One final consideration: below some “critical batch size”, the optimal learning rate scales linearly with the batch size (see “Figure 5”​10​). This means that when increasing the batch size, you should try increasing the learning rate as well. Intuitively, instead of taking several small noisy gradient steps, you take one larger step with a more accurate gradient that ends up at the same point in parameter space.
  • If your network is too small, it might limit the performance of your policies, so if policies are failing to learn or hitting a plateau it’s worth trying to increase the size of the neural network. As a point of reference, many challenging RL tasks can be solved with only two or three layers that are a few hundred units wide.

Domain Knowledge

If you are interested more in solving a concrete task rather than advancing ML research, there are often many ways to mitigate weaknesses in ML methods by augmenting them with tricks that are specific to the task at hand. Some behaviors are difficult to learn with general methods, but very easy to implement directly. Rewarding agents just for the final game outcomes is elegant and at least in theory converges to the optimal strategy, but a dense reward that encourages intermediate objectives is often much more effective. We have already mentioned feature engineering. Likewise, there often many different ways of constraining or expanding the action space that make certain skills easier to discover. Some examples: OpenAI Five​2​ made use of reward shaping and both the courier and item builds are hardcoded, AlphaStar​7​ derived a shaped reward function from statistics of human games to encourage agents to follow specific build orders and strategies, the participants of the ViZDoom competition​11​ used a variety of domain-specific tweaks (search for “Notable Solutions”), the second-place winners of the Obstacle Tower Challenge​12​ used a reduced action space and reshaped reward, and all MineRL​13​ competitors restricted the action space by removing actions rarely taken by human experts.


While many parts of RL and ML are more of an art than a science, there are still various intuitions and an overarching process that we can follow that tends to produce results. By their very nature, these are harder to convey than concrete algorithms and equations, but nonetheless constitute an important part of machine learning practice. For many additional recommendations for training machine learning models, check out “A Recipe for Training Neural Networks​14​, and for a curated collection of many additional resources covering reinforcement learning, see andyljones/reinforcement-learning-discord-wiki. Of course, there is no substitute for actual practice, so if you really want to become an expert go pick a project that seems interesting to you and just have fun with it!


Thanks to Anssi Kanervisto and Henrique Pondé de Oliveira Pinto for reviewing drafts of this article.


  1. 1.
    Karpathy A. Software 2.0. Medium. Published online 2021.
  2. 2.
    Berner C, Brockman G, Chan B, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:191206680. Published online 2019.
  3. 3.
    Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. arXiv preprint arXiv:200108361. Published online 2020.
  4. 4.
    Henighan T, Kaplan J, Katz M, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:201014701. Published online 2020.
  5. 5.
    Narang S, Chung HW, Tay Y, et al. Do Transformer Modifications Transfer Across Implementations and Applications? arXiv preprint arXiv:210211972. Published online 2021.
  6. 6.
    Espeholt L, Soyer H, Munos R, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In: International Conference on Machine Learning. PMLR; 2018:1407–1416.
  7. 7.
    Vinyals O, Babuschkin I, Czarnecki WM, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature. 2019;575:350–354.
  8. 8.
    Winter C. Mastering Real-Time Strategy Games with Deep Reinforcement Learning: Mere Mortal Edition. Clemens’ Blog. Published online 2021.
  9. 9.
    Akkaya I, Andrychowicz M, Chociej M, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:191007113. Published online 2019.
  10. 10.
    McCandlish S, Kaplan J, Amodei D, Team OD. An empirical model of large-batch training. arXiv preprint arXiv:181206162. Published online 2018.
  11. 11.
    Wydmuch M, Kempka M, Jaśkowski W. Vizdoom competitions: Playing doom from pixels. IEEE Transactions on Games. 2018;11:248–259.
  12. 12.
    Juliani A, Shih J. Announcing the Obstacle Tower Challenge winners and open source release. Unity Technologies Blog. Published online 2019.
  13. 13.
    Milani S, Topin N, Houghton B, et al. Retrospective analysis of the 2019 minerl competition on sample efficient reinforcement learning. In: NeurIPS 2019 Competition and Demonstration Track. PMLR; 2020:203–214.
  14. 14.
    Karpathy A. A Recipe for Training Neural Networks. Andrej Karpathy Blog. Published online 2019.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.