Mastering Real-Time Strategy Games with Deep Reinforcement Learning: Mere Mortal Edition

The capabilities of game-playing AIs have grown rapidly over the last few years. This trend has culminated in the defeat of top human players in the complex real-time strategy (RTS) games of DoTA 2​​ [1]​​ and StarCraft II​ [2]​ in 2019. Alas, the exorbitant engineering and compute resources employed by these projects has made their replication difficult. As a result, the application of deep reinforcement learning methods to RTS games has remained disappointingly rare. In an attempt to remedy this sad state of affairs, this article demonstrates how you can use deep reinforcement learning to train your very own sociopaths for a nontrivial RTS game within hours on a single GPU. We achieve this by employing an array of techniques that includes a novel form of automatic domain randomization, curricula, canonicalization of spatial features, an omniscient value function, and a network architecture designed to encode task-specific invariants.

Hacker News (129 points, 66 comments)
r/MachineLearning (34 points, 5 comments)
r/reinforcementlearning (14 points, 1 comment)


By virtue of being commercial games that weren’t originally designed with contemporary reinforcement learning (RL) methods in mind, DoTA 2 and StarCraft II exhibit many avoidable incidental complexities that make training RL agents for them especially costly and challenging. These difficulties can be greatly reduced by choosing games that are more amenable to RL research. In this work, we will be training AIs that are capable of playing CodeCraft. Originally devised to teach my brother how to program (as one does), CodeCraft was designed from the ground up to be used via an API and implements a small but flexible set of game mechanics that still captures many of the core elements of conventional RTS games and allows for the emergence of interesting strategies.

Before embarking on this project I used the following reasoning to conclude that CodeCraft is around 4 orders of magnitude easier than DoTA 2, thus potentially allowing it to be solved by RL with a more modest amount of resources:

  • Games complete within 5 minutes rather than 50 minutes.
  • There are 1-2 orders of magnitude fewer objects/observations and the action space is orders of magnitudes smaller.
  • CodeCraft can be run at 105 frames per second in headless mode on a single core.
  • The API surface is very small, drastically reducing the engineering effort required to interface with the game.
  • It is easy to come up with reward functions that provide fast feedback cycles without introducing strong bias.
  • The complexity of the game can be scaled to any desired level.

Of course, as an engineer, I’m in the business of creating working systems rather than just crafting convincing arguments so this was but the beginning of my work. I figured it would take me around six months to beat the scripted AIs I wrote for CodeCraft many years ago if everything went completely according to plan. (Narrator: Not everything went completely according to plan.) So anyway, two years later here we are!

This article is one in a series of posts about my work on CodeCraft and presents my methods and results. My Reinforcement Learning Learnings gives an overview of the different workflows and intuitions I adopted over the course of this project and will be of particular interest to anyone wishing to learn how to tackle similar problems. Conjuring a CodeCraft Mind recounts a mildly embellished history of this project in the classic style of dark fantasy machine learning poetry (it seemed like a good idea at the time). A future article will explore the broader implications of emerging machine learning techniques for the video game industry.


CodeCraft is a two-player real-time strategy programming game modeled after games like StarCraft. CodeCraft is intended to be “played” by writing a program that controls an army of game units (“drones”), though in this work we will be training a neural network that learns to play the game rather than programming explicit behaviors. Drones can move across the two-dimensional map, harvest resources (“mineral crystals”), use minerals to build new drones, and attack enemy drones. Drones come in various sizes and can be equipped with modules that grant them different abilities. Larger drones can equip more modules and have more hitpoints (2 per white side) but cost more resources to produce and move more slowly than smaller drones.

The side of a drone that faces away from the direction of movement is colored either blue or orange to show the player it belongs to. At the start of the game, each player is given a single drone that is capable of both combat and producing new drones and the objective of the game is to eliminate all drones of the opposing player. There are five different types of modules drones can equip, each providing a unique ability.

The storage module allows drones to harvest mineral crystals and store them as resources. Resources can be transferred to nearby drones.

The constructor module allows a drone to build new drones. Building a drone costs 5 resources per module. The number and type of modules of the new drone can be freely chosen. While a drone is building another drone, it cannot move.

The missile battery module allows a drone to shoot homing missiles at an enemy drone. A drone hit by a missile looses one hitpoint. When a drone reaches zero hitpoints, it is removed from the game.

Homing missiles can only be fired at an enemy within a distance of 300 (red circle).

The shield modules empowers drones with an additional 7 hitpoints. A depleted shield slowly regenerates.

The engine module increases drone movement speed.

After bumping into each other, drones are briefly unable to move.

The initial starting drone, the mothership, boasts 1 shield module, 3 missile modules, 3 constructor modules and 3 storage modules. It is both a powerful combat unit and proficient at harvesting minerals and producing new drones.

A drone can observe objects within a distance of 500 (green circle), other objects remain hidden.

If you want to see all of this in action, watch games of CodeCraft at

Qualitative Results

First, a quick summary of how my AIs work and how they are trained. The central component is a neural network, which in this context is usually referred to as a “policy”. Every 10 game ticks (6 times per second), the policy receives lists of the objects that are currently visible to the player as an input and computes an action for each drone. The set of available actions comprises different movement choices (stay in place, move forward, turn left/right) and a set of drone types that can be built by drones with constructor modules. The policy is initialized randomly at the start of training. Training alternates between playing games of CodeCraft with the current policy against itself, and then using the outcome of those games to create an improved version of the policy that is more likely to choose successful actions.


To begin, let us follow one of my policies through its training process and highlight some of the skills it acquires. The specific policy we will look at is called Grateful Flower (I affectionately refer to my policies by the name autogenerated for them by the Weights & Biases experiment tracking framework). Grateful Flower’s training begins on tiny maps where even completely random movement frequently results in impactful interactions:

Figure 3.1a. Grateful Flower at training step 0. Sped up by 3x.

Within the first 1 million samples, Grateful Flower shifts production towards drones equipped with weapon modules. This is the first in a vast array of skills that will allow it to efficiently exterminate all opposition. No worries though, I have them safely contained within the simulation this is fine 😅.

Figure 3.1b. Fractions of drone types produced by Grateful Flower during the first 1 million training steps. (Dashed blue line: drones with one missile module. Orange line: drones with one storage module.)

At 2.5 million training samples, Grateful Flower still moves fairly randomly but shows a definite tendency for moving towards minerals and enemy drones.

Figure 3.1c. Grateful Flower at training step 2.5M. Sped up by 2x.

At 5 million samples, Grateful Flower has started to develop a basic randomized scouting technique that consists of mostly moving forward and sometimes changing direction:

Figure 3.1d. Grateful Flower at training step 5M. Sped up by 2x.

At 10 million samples, Grateful Flower has grown fond of building drones with 3 missile modules and one shield module. These are very strong against unskilled opponents since they win 1:1 fights against most other drone types and can then retreat and regenerate their shield hitpoints. Grateful Flower has started to develop a number of micromanagement tactics (“micro”). This includes grouping up multiple units, retreating when outnumbered, and waiting for sufficient army advantage before attempting to attack motherships:

Figure 3.1e. Grateful Flower at training step 10M. Sped up by 2x.

At 20 million samples, Grateful Flower has further refined these skills and become much more effective at scouting larger maps:

Figure 3.1f. Grateful Flower at training step 20M. Sped up by 2x.

At 40 million samples, Grateful Flower is capable of fairly precise micro:

Figure 3.1g. Grateful Flower at training step 40M. Sped up by 2x.

At 80 million samples, Grateful Flower is in the middle of phasing out drones with 2 missile and 2 shield modules in favor of drones with 2 missiles 1 shield and 1 engine modules. These make up for their lower raw strength with superior mobility:

Figure 3.1h. Grateful Flower at training step 80M. Sped up by 2x.

Training for another 45 million samples (for a total of 241 days of real-time) gives Grateful Flower a win rate of more than 80% against its former self at 80 million steps.

See Appendix B for longer, non cherry-picked videos of Grateful Flower during various stages of training.

Strategy and Tactics

Policies invariably adopt a strategy of relentless aggression based primarily around fast-moving drones with two missile, one engine, and one shield module (or 2m1e1p for short). At the start of the game, policies generally opt to delay drone construction until they find the first mineral crystal and can harvest enough resources to open with a 2m1e1p drone:

Figure 3.2a. Opening build of Drawn Forest 125M. Sped up by 2x.

Once built, the new drone immediately starts scouting and moves to control the center of the map:

Figure 3.2b. Drawn Forest 125M scouts with its first new drone. Sped up by 2x.

Once the enemy is found, policies engage and try to eke out an advantage through better micro:

Figure 3.2c. First engagement between Drawn Forest 125M and Absurd Waterfall 125M. Sped up by 2x.

Often there is a lengthy back-and-forth as during which neither player manages to obtain a decisive advantage. But when a mistake happens, the game can snowball very quickly:

Figure 3.2d. A reinforcing drone gives Drawn Forest 125M a decisive victory in an engagement against Absurd Waterfall 125M. Sped up by 2x.

Once a policy has gained a large army advantage, it quickly mops up any remaining forces:

Figure 3.2e. Drawn Forest 125M defeats the remaining forces of Absurd Waterfall 125M. Sped up by 2x.

When no enemies are visible, drones fan out and scout in a loose formation. This provides a good compromise between efficiently exploring a large area while still making it possible to quickly regroup into a tighter formation when enemies are encountered:

Figure 3.2f. Three blue drones fan out and scout the map in loose formation after chasing down an enemy drone. Sped up by 2x.

Policies are also capable of other basic forms of coordination, such as waiting for reinforcement outside the weapons range of an enemy mothership and then attacking with a united front:

Figure 3.2g. Drones wait for reinforcements outside the weapons range of the enemy mothership before attacking.

The majority of policies eventually adopt a fast-moving unit composition consisting primarily of 2m1e1p with occasional 1m1p and 1m drones. However, they also experiment with other unit compositions during training and in rare cases settle on a unit composition that consists of stronger but slower-moving drones such as 1m1p, 3m1p, and 2m2p. With the correct micro and good positioning, this unit composition beats out the more common one in straight-up engagements:

Figure 3.2h. The forces of Drawn Forest 125M are pushed against the edge of the map and defeated by Silvery Firefly 125M. Sped up by 2x.

The reason that the more mobile unit composition may be favored is that it makes it possible to pick off units that are out of position and quickly exploit temporary advantages. If a formation of strong but slow-moving units can be broken and scatters, individual units are quickly chased down and the game snowballs:

Figure 3.2i. Drawn Forest 125M seizes upon a temporary unit advantage to take out a slow enemy unit and proceeds to scatter the forces of Silvery Firefly 125M, chase down isolated enemy drones, and eliminate the enemy mothership. Sped up by 2x.

Somewhat disappointingly, policies almost uniformly opt for immediate aggression instead of investing into their economy (some policies do build drones with 1 constructor and 1 storage module on occasion). It is possible that this is due to a limitation of the training process that prevents greedier strategies from being discovered, but it is equally likely CodeCraft is poorly balanced, and more economically focused strategies simply aren’t viable. Seeing as my policies are currently the world’s best CodeCraft players, we’ll just have to take their word for it for the time being.

Out-of-Distribution Generalization

Before this project, the best CodeCraft AIs available to me were two scripted bots, “Replicator” (which builds some harvester/construction drones at the start of the game and then attacks with large swarms of small drones) and “Destroyer” (which builds very cheap and fast drones at the beginning of the game to quickly scout the map and find the enemy mothership which it then attacks with strong, slower-moving drones). Against Replicator (which I thought followed a solid strategy), policies achieve win rates between 90%-98%. Replicator’s main shortfalls seem to be investing into economy and losing units to inaccurate micro (primarily due to being tuned for issuing an action on every tick instead of every 10 ticks, which is how policies are trained and all evaluation games are configured):

Figure 3.3a. Absurd Waterfall 125M eviscerates Replicator. Sped up 2x.

Against Destroyer, my policies typically win 70% of games. Many of the losses are just due to taking longer to find resources at the start of the game, which is at least partially down to chance, but Destroyer also reveals a number of limitations to out-of-distribution generalization.

In the late-game, Destroyer builds a very large type of drone that my AIs never encounter during training. Unsurprisingly, policies don’t know how to correctly micro against this drone, though often they still end up doing something more or less reasonable:

Figure 3.3b. Imperfect micro notwithstanding, Drawn Forest 125M defeats a strong unit produced by Destroyer that it has never encountered in training. Sped up 2x.

My AIs will often move to and stay at the last known position of an enemy drone. When playing against itself during training, this strategy works quite well and is highly likely to quickly encounter enemy forces following the same approach. However, applying this approach to the fast scouting drones used by Destroyer sometimes causes the army to be out of position and unable to defend the mothership:

Figure 3.3c. Absurd Waterfall 125M sends all its forces to the last known position of an enemy scout and looses its mothership against an attack by Destroyer. Sped up 2x.

My policies are hard-coded to always attack the closest enemy. This means that during training it is sufficient to have weakened drones move back only a little further than nearby allies to protect them. This causes some avoidable losses against Destroyer which focus fires weakened drones more effectively:

Figure 3.3d. Destroyer uses target firing to take out a weakened drone of Absurd Waterfall 125M that has retreated behind its allies.

See Appendix C for 9 hours of randomly sampled games between different policies and scripted bots.


Most of the methods described here do not require deep expertise in machine learning or reinforcement learning (RL) to comprehend but some basic intuitions about deep RL are helpful. As such, a high-level overview of deep RL is supplied by the following section.

Deep Reinforcement Learning

RL methods allow us to train agents that attempt to maximize a numerical reward signal received while interacting with an environment (such as CodeCraft) over multiple timesteps. On every timestep $t$, the agent is given a (partial) observation $S_t \in \mathbb{S}$ of the state of the environment (e.g. a list of objects visible to the player) and is allowed to choose an action $A \in \mathbb{A}$ that can influence future states of the environment (e.g. in what direction each drone should move). After each timestep, the agent also receives a numerical reward. The goal of RL is to find the optimal policy $\pi : \mathbb{S} \to \mathbb{A}$ that maps observations to actions that maximize the total expected reward summed over all interactions with the environment. Often, as is the case here, our policy is stochastic, and instead of deterministically computing a single action, it assigns a probability $\pi(A|S_t)$ to every possible action and then chooses an action according to the probabilities.

In the case of deep RL, the policy is parameterized by a deep neural network. At the start of training, the neural network is initialized randomly and takes each action with roughly equal probability. We train the policy by alternating between rollout and optimization phases. During rollouts, we let the policy interact with the environment for a number of timesteps by obtaining an observation, computing the probability of each action under the policy, randomly choosing actions according to this probability distribution, and then passing it to the environment to advance the state by one timestep. All observations, actions, and rewards are recorded for later processing. For performance reasons, it is common to run several environments in parallel during rollouts. During the optimization phase, we adjust the policy’s parameters to make it slightly more likely for it to take the actions that lead to high reward values in the previous rollout phase.

My policies are trained by playing games of CodeCraft against themselves. The specific algorithms employed in this work are called proximal policy optimization (PPO)​ [3]​ and generalized advantage estimation (GAE)​ [4]​. These algorithms determine the gradients that tell us how the policy’s parameters should be adjusted after each rollout. Rollouts are performed in 64 parallel environments for 128 sequential steps. PPO is an instance of actor-critic methods that also learn a value function $V : \mathbb{S} \to \mathbb{R}$ that predicts future reward values. The value function allows us to produce lower-variance estimates of how good an action choice was by asking whether the received reward was lower or higher than the reward predicted by the value function. The value function also makes it possible to learn from partial episodes (as opposed to waiting for games to run fully to completion) by substituting the prediction of the value function for any future rewards at the end of the partial episode, at the cost of introducing bias. I am glossing over a lot of details here. If you want to learn more, Deep Reinforcement Learning: Pong to Pixels and Spinning Up in Deep RL provide good starting points for further study.


The policy implemented here observes a set of global properties, as well as lists of up to 15 allied drones, 15 enemy drones, 5 mineral crystals, and 5 “tile” objects which correspond to grid points on the map that have not been visited for the longest time. The features of objects that are not within the sight radius of any allied drone are set to their last known value. The value function observes all objects regardless of whether they are visible to the player. See Appendix A for a full list of features for each type of object and Appendix E for a detailed example.


The actions that can be performed by each drone are modeled by a single categorical distribution that includes 6 basic movement actions and 11 actions for starting the construction of different drone types. An action can be performed once every 10 game ticks, or 6 times per second when running at 60FPS. Drones automatically harvest resources within range. Drones are prevented from moving while collecting resources. To simplify the implementation, drones also automatically shoot at the closest enemy within range whenever possible rather than choosing which enemy to target as allowed by CodeCraft. This doesn’t affect the optimal overall strategy too much but does have a non-negligible effect on micromanagement because protecting low-health drones requires much more conservative positioning when arbitrary targeting is possible.

Movement Actions

Drone movement is discretized into six different actions: (1) no movement, (2) move forward, (3, 4) turn by ±0.249 radians (chosen as slightly less than 0.25 radians, the maximum rotation possible on a single tick) and then move forward for the remaining 9 ticks, (5, 6) turn by ±2 radians and then move forward. This set of movement actions is a subset of the full movement options available in CodeCraft and somewhat limits the level of precision that can be attained by the policies. Other formulations were not tried and may yield improvements.

Figure 4a. Demonstration of the 6 different movement actions.

Build Actions

The full game of CodeCraft allows for building drones with any combination of up to 10 total modules from the 5 different module types. For simplicity, policies are restricted to 11 different drone types which were chosen to cover an interesting space of strategies and include all the drone types used by the scripted Replicator AI:

  • 1 missile (1m)
  • 1 storage (1s)
  • 2 missile (2m)
  • 1 missile, 1 shield (1s1p)
  • 2 missile, 1 engine, 1 shield (2m1e1p)
  • 2 missile, 2 shield (2m2p)
  • 3 missile, 1 shield (3m1p)
  • 1 storage, 1 constructor (1s1c)
  • 2 storage, 2 constructor (2s2c)
  • 2 storage, 1 constructor, 1 engine (2s1c1e)
  • 2 storage, 1 missile, 1 constructor (2s1m1c)

Action Masking

The probabilities for actions that are invalid at the current timestep are explicitly set to zero. When a drone is stunned or constructing another drone, all actions are masked out except the action for no movement. When a drone does not have any constructor modules, all build actions are masked out. Additionally, when a drone does not have sufficient resources stored to complete a specific build, the corresponding build action is masked out to ensure the drone does not get stuck (drones cannot move while constructing).

Reward Function

Each player $P$ is assigned a score $S_P$ computed as the total resource cost of all the player’s drones $\mathbb{D}_P$ currently in the game, with some adjustment for lost hitpoints:

$S_P = \sum\limits_{d \in \mathbb{D}_P} \text{resource_cost}(d) \times \left( \frac{\text{max_hitpoints}(d)}{\text{hitpoints}(d)} + 1 \right) $

The reward received during training is equal to the change in $\frac{2 S_{self}}{S_{self} + S_{opponent}} – 1$ from the previous timestep. In games where one player is completely eliminated, the sum total reward over the entire episode will equal 1 for the winner and -1 for the loser. Additionally, players receive a one-time reward of +2 when they win the game by eliminating their opponent. Using just the reward of +2 for winning games works as well but results in less capable policies.

Automatic Domain Randomization

When training directly on the unmodified game of CodeCraft, policies often struggle to escape local minima such as suboptimal unit compositions and overly defensive tactics. The learning of robust policies is made possible by curricula and randomizations of the game mechanics that smooth out exploration cliffs and ensure a diverse training distribution. Choosing the right set of randomizations that cover aspects of the environment that policies don’t naturally explore is more important than the precise mathematical formulations presented here, which were chosen primarily for ease of implementation and did not receive much tuning.

Continuous Parameterization of Discrete Game Mechanics

Several aspects of CodeCraft’s game mechanics are represented by discrete integers. One example of this is the number of resources it costs to produce a new drone. To allow for fine-grained parameterization of these game mechanics, we can apply a simple modification to the CodeCraft implementation that makes these game mechanics probabilistic in a way that allows us to smoothly interpolate between discrete outcomes. For example, suppose we want a specific drone type to cost 3.7 resources on average during a specific episode. Then whenever this drone type is constructed, we would choose a cost of 3 with a probability of 30% and a cost of 4 with a probability of 70%. In general, if the cost is parameterized as $c$, we can generate a discrete sample with an expected value of $c$ as $\lfloor c \rfloor + \text{Bernoulli} \left( c – \lfloor c \rfloor \right)$ where $\lfloor \cdot \rfloor$ is the floor function.

Automatic Module Cost Balancing

To encourage exploration and ensure that all drone types are used by policies during training, the cost of different drone types is automatically adjusted depending on how often they are built during rollouts. First, in order to encourage greater variety in module choices, the cost of over/underutilized module types and drone sizes is increased/decreased. Second, to keep drone costs close to the target distribution, cost modifiers are moved towards their default values. A single “variety” hyperparameter controls the tradeoff between the two objectives and is decreased over the course of training to allow for high levels of initial exploration followed by increasing specialization towards the target distribution.

Specifically, the CodeCraft game is parameterized with a set of cost multipliers $c_n \in \mathbb{R}^+$ and $c_t \in \mathbb{R}^+$ for each drone size $n$ and module type $t$. Policies receive these cost multipliers as input features. The cost of a drone with $n$ total modules and $n_m $ modules of type $m$ is sampled at construction time from $\lfloor C \rfloor + \text{Bernoulli} \left( C – \lfloor C \rfloor \right) $, where $C = 5c_n \sum_m n_m c_m $. The standard game of CodeCraft is obtained when $c_x = 1$ for all $x$. The cost modifiers for each training episode are sampled as $c_x \sim \log(U(1, e^{C_x}))$ if $C_x \ge 1$ and $c_x \sim \log(U(e^{C_x}, 1))$ if $C_x < 1$ where $U$ is the uniform distribution and $C_x$ is a set of variables initialized to 1 that are adjusted slightly after each rollout phase to balance out the variety and distance to the target distribution. See for additional details.

One problem with the parameterization presented here is that the utility of a type of drone (as judged by a partially trained neural network) can’t be approximated well by the decomposition into its individual modules. Some other coincidental details of my setup happen to ameliorate this issue. In follow-up work, I have moved to a simpler and more robust scheme where each allowed build has a single cost modifier that is independent of the cost for other drone types and which is sampled from a log-normal distribution.

Map Size Curriculum

To increase the frequency of instructive interactions during the early stages of training before drones learn to effectively scout the map and actively engage enemies, the maximum size of the map is controlled by a parameter $A $ which is slowly increased throughout training according to

$$A = \mathrm{clip} \left( a (t – t_0) + A_{\min} , \min = A_{\min}, \max = A_{\max} \right) $$

where $t $ is the total number of samples that have been processed so far and $A_{min}$, $A_{max}$, $t_0$, $a$ are hyperparameters that determine the initial area, maximum area, first step at which the maximum map size is increased, and the rate of increase. Due to a limitation in CodeCraft’s physics engine, all map sizes have to be an integer multiple of the sight radius of drones (500). When starting a new game, the width $w$ and height $h$ of the map are chosen uniformly from the set:

$$\lbrace (w, h) \in \mathbb N \times \mathbb N \mid w \equiv h \equiv 0 \mod 500, w * h \leq A \leq 2 * w * h \rbrace $$

Mothership Damage Randomization

To allow policies to gradually learn the coordination required to defeat motherships, games are parameterized with a “mothership damage multiplier” $d$. Instead of taking 1 point of damage, the damage received by motherships is randomly sampled from $\lfloor d \rfloor + \text{Bernoulli} \left( d – \lfloor d \rfloor \right) $. At the start of each game, $d$ is sampled as $d \sim 2^{U(0, D)}$ where $D $ is a hyperparameter that starts at 4 and is decreased to 0 over the course of training.

Starting Drone Randomization

In 30% of games, instead of the normal mothership, both players start with two smaller motherships sampled from the set {2s2c, 2s1c1e, 2s1c1m, 1s1c}.

Network Architecture

The network architecture consists of one copy of a policy network for each allied drone currently in existence (up to a maximum of 15) and a single shared value network. Some components of the policy and value networks also share computation and/or parameters.

Policy Network

Figure 4.6a. High level overview of the policy network architecture.

Each policy network receives as input all global features, the features of a single allied drone, and the lists of features for up to 15 allied drones (which include the drone itself), 15 enemy drones, 5 minerals, and 5 tiles. The final output of the policy network is the action probabilities for the single drone it controls. The relative positions of the up to 30 different objects are processed by the same subnetwork before being joined with the corresponding embeddings computed from the objects’ features. The latent values before the final policy head are also utilized by the value network. The weights of the initial feedforward networks operating on the lists of drones/crystals/tiles are shared between the policy and the value network. My implementation makes use of the excellent torch-scatter library to efficiently process variable-length lists. See Appendix E for a detailed example of policy inputs and outputs.

Value Function Network

Figure 4.6b. High level overview of the value function network architecture.

The policy has a single value function network that predicts the return received on each timestep and which are utilized by generalized advantage estimation. The weights and computation for the layers processing lists of allied drones, mineral crystals, and tiles are shared with the policy networks. The weights of the layers processing enemy drones are also shared with the policy network but receive different inputs that include the current state of all enemy drones even when they are not visible to the player.

Input Embeddings

Figure 4.6c. Input embedding.

The input embedding layers first normalize all input features $\mathbf{x} $ by their running mean $\boldsymbol{\mu}$ and variance $\boldsymbol{\sigma}$ and clip them to the range $[-5, 5]$:

$z_i = \text{clip} \left( \frac{x_i – \mu_i}{\sigma_i}, -5, 5 \right) $

This normalization is followed by an affine transformation with a learnable $D \times F $ matrix $\mathbf{\mathit{W}} $ and bias $\mathbf{b}$, a ReLU​ [5]​ nonlinearity, and LayerNorm​ [6]​:

$\mathbf{y} = \text{LayerNorm} \left( \max \left( 0, \mathbf{\mathit{W}} \mathbf{z} + \mathbf{b} \right) \right) $

For the lists of nearby objects and the list of relative positions, $D = 64$, and for the embedding derived from the active drone and global features, $D = 256$.

Feedforward Residual Layers

Figure 4.6d. Feedforward residual block.

The network contains several feedforward residual layers that apply two affine transformations and a ReLU to every input element $\mathbf{x}$

$$\text{FF} \left( \mathbf{x} \right) = \mathbf{\mathit{W}}_2 \max \left( 0, \mathbf{\mathit{W}}_1 \mathbf{z} + \mathbf{b}_1 \right) + \mathbf{b}_2 $$

where $\mathbf{\mathit{W}}_1 \in \mathbb{R}^{2D \times D} $, $\mathbf{\mathit{W}}_2 \in \mathbb{R}^{D \times 2D}$, $\mathbf{b}_1 \in \mathbb{R}^{2D}$ and $\mathbf{b}_2 \in \mathbb{R}^{D}$ are matrices/vectors of learnable parameters. The output of the feedforward layer is added to its original input and followed by LayerNorm:

$\mathbf{y} = \text{LayerNorm} \left( \mathbf{x} + \text{FF} \left( \mathbf{x} \right) \right) $.

Object Concatenation

Figure 5e. Concatenation of object embeddings and relative position embeddings.

The first concatenation layer in the policy network concatenates the embeddings of all objects into a single list and joins them with the embeddings of their corresponding relative positions.

Multi-Head Attention

Figure 4.6e. Scaled dot product multi-head attention block.

The multi-head attention block allows the active drone to attend to all visible objects using multi-head scaled dot product attention​ [7]​, with queries derived from the single 256 width embedding computed from the global features and features of the active drone, and keys and values derived from the list of embeddings of visible objects. The number of attention heads is 8.

Penultimate Dense Layer

Figure 4.6f. Penultimate dense layer.

The penultimate layer of the policy network, the output of which is consumed by the policy and value function heads, applies a learnable affine transformation that is followed by a ReLU but not a residual connection nor LayerNorm.

Policy Head

Figure 4.6g. Policy head.

The final layer of the policy network applies a learnable affine transformation to project the hidden state onto a vector $\mathbf{z} \in \mathbb{R}^{11} $ where, if action $i$ is invalid, we set the corresponding entry $z_i = -\infty $. Action probabilities $\mathbf{p} $ are obtained by applying the softmax function:

$p_i = \frac{ e^{ z_i } } { \sum_j e^{z_j + m_j} } $

Pooling Layers

Figure 4.6h. Value function pooling layers.

The value function collapses the lists of final hidden states for each active drone by taking the max of each dimension, and reduces the embeddings for all objects into a single vector by taking and concatenating the max and the average of each dimension.

Spatial Invariance

The behavior of each drone should be largely invariant to translations of the position of all objects and depend only on the relative positions of objects with respect to the position of the drone. Since all movement actions either rotate drones or move them along the current orientation, the policies should also be largely invariant to any rotations of the world that are centered on the drone. We directly encode these invariances by canonicalizing positions by applying an isometric transformation that places the objects in the coordinate system in which the drone is positioned at (0, 0) and is oriented along the x-axis. There is also a currently unexploited mirror symmetry under which the rotational movement actions are exchanged with the corresponding action rotating in the opposite direction. Presumably, this symmetry could be utilized as a data augmentation, but I don’t know of a way to enforce this symmetry directly without necessitating additional network passes.

Using relative positions does prevent different drones from sharing the trunk of the network. This means that time complexity grows quadratically with the number of drones. This is already true for the dot-product attention operation in vanilla transformers, but here the number of dense matrix multiplications grows quadratically as well. I have made some unsuccessful attempts to operate on absolute positions instead. But at this point, I am relatively convinced that is desirable for the amount of computation performed to scale with the number of drones, and that dedicating a large fraction of computation to calculations on the relative positions between objects is beneficial. The number of objects in CodeCraft does not grow large enough for the quadratic factor to become too problematic. Something along the lines of Barnes-Hut could be an interesting approach for scaling to much larger numbers of objects.

Miscellaneous Details

Policy rollouts are performed for 128 sequential steps in 64 parallel environments, each generating samples for two copies of the policy. Training is run for a total of 125 million samples. All samples from the rollout are combined into a single batch of size 16384 using gradient accumulation. Optimization makes two passes over the data and uses the Adam optimizer with a weight decay of 1e-4 and an initial learning rate of 5e-4, which is decayed to 5e-5 over the course of training with a cosine schedule. PPO clip range is set to 0.2, and GAE uses a discount factor γ of 0.999 and a λ of 0.95. The training loss includes an entropy term which is weighted by 0.2 at the start of training and decayed to 0 by step 90 million. In my usual setup, I run 8 independent training processes on a machine with four RTX 2080 Ti GPUs (sharing each GPU between two runs which results in higher throughput) and a 32-core AMD Ryzen Threadripper 2990WX. In this configuration, training takes just over 2 days, about 12 hours of which are spent on evaluating the performance of the policy rather than training. When training on a dedicated GPU and skipping evaluations, policies reach good performance within less than a day (i.e., hours). Further details can be found by consulting my code which is available on GitHub, bugs and all.


In aggregate, the above methods improve sample efficiency by more than an order of magnitude over a naive implementation. To more clearly characterize individual contributions, I ran ablation-experiments that disable individual techniques and measure the impact on the agent’s performance. Experiments in this section were performed with DeepCodeCraft@f2034f9 and CodeCraftServer@40fe788 unless noted otherwise. Due to compute limitations, I was unable to tune hyperparameters for the different settings which biases all experiments in favor of the baseline. For trivial reasons, my results set a new state-of-the-art. See for instructions on how to reproduce these results. Configs for all ablations are in DeepCodeCraft/ablations. The raw data and many additional metrics for all runs can be found at The logs and final and intermediary model checkpoints for all runs are available here, though unfortunately my job runner occasionally assigns the same output directory to different runs, which caused some directories to get clobbered.

Evaluation Procedure

Every 5 million samples, policies are evaluated for a fixed number of timesteps on a map of size 6000×4000 against two scripted AIs included with CodeCraft (Replicator and Destroyer) and two policies created with an earlier version of my codebase (Curious Galaxy 40M and Graceful Frog 100M). Each evaluation comprises approximately 500 total games, with the final evaluation at 125 million samples running for longer and comprising approximately 3000 games. The metric used to score each game is the reward function described previously. The score is linear in the fraction of drones weighted by cost controlled by the player at the end of the game. If the player won or lost the game by elimination, as is the case in most games, the score equals +1 or -1 respectively.


All subsequent runs are compared to a baseline run with the default hyperparameters.

Figure 5.2a. Mean evaluation score of baseline run against all four opponents.
Figure 5.2b. Evaluation score of baseline runs against each opponent.

Unless noted otherwise, each experiment is performed with 8 independent training runs. The shaded area shows the minimum/maximum scores of all runs, the dotted line is the mean score across all runs, and the error bars denote the standard error of measurement (but note that run scores are not normally distributed and there are likely large systemic sources of error such as from sub-optimal hyperparameters).

Omniscient Value Function

In this experiment, the value function receives the same inputs as the policy and does not receive information about objects not visible to the player. Only results of 5 runs are reported since the other 3 runs crashed with out-of-memory errors that were caused by policies playing very badly and building up unusually high numbers of drones. Since the partial results are quite conclusive I did not rerun the experiment.

Figure 5.3a. Performance comparison of run without omniscient value function.

Rotational Invariance

In this experiment, the positional features of objects are not rotated to align the orientation of the drone with the x-axis (but are still translated to place the drone at the origin).

Figure 5.4a. Performance comparison of run without rotational invariance.
Figure 5.4a. Performance comparison of run without rotational invariance, broken down by opponent.

Automatic Domain Randomization (ADR) and Curriculum

The following table summarizes the final performance of runs that disable different subsets of the automatic module cost balancing, mothership damage modifier, and map size curriculum. For the map size column, “curriculum” corresponds to the setup described in the methods, “randomization” does not gradually increase the maximum map size but still uses the same sampling scheme for the map size, and “fixed” trains directly on the map used for evaluations. All runs with fixed map size were performed with DeepCodeCraft@0494306. The experiment with “randomized” module cost corresponds to a feeble attempt at replacing the automatic module cost balancing with a simpler scheme described in Appendix F. One of the 8 runs for this experiment crashed and is excluded from the results.

Map SizeModule CostMothership DamageScore

When used in combination with any of the other randomizations, the mothership damage multiplier is not only redundant but actually decreases performance. The map size curriculum seems to be somewhat helpful, and the automatic module cost balancing provides a large increase in performance. The effect of the automatic module cost balancing on training diversity is very apparent when comparing the fraction of each drone type produced during training for typical runs with and without this ADR enabled:

Figure 5.5a. Fraction of each drone type produced during training for “Dainty Shape” (module cost ADR enabled) and “Kind Morning” (module cost ADR disabled).
Figure 5.5b. Comparison of baseline against runs without module cost ADR. Performance is comparable to baseline against weaker opponents, but much lower against “Graceful Frog 100M” which has found a unit composition that provides an effective counter to the unit composition adapted by policies trained without module cost ADR.

Shared Spatial Embeddings

In this experiment, there is no shared feedforward network for object positions. Instead, positions are included with the features of each object and processed separately by the feedforward network for each object type. The activation width of the per-object feedforward networks is increased from 64 to 128. The modified policy without shared spatial embeddings is implemented in DeepCodeCraft@7a9d92a.

Figure 5.6a. Performance comparison of run without shared spatial embeddings.
Figure 5.6b. Performance comparison of run without shared spatial embeddings, broken down by opponent.

Sparse Reward Function

In this experiment, the only nonzero reward received by agents is a one time reward of +2 when winning games.

Figure 5.7a. Performance comparison of run with sparse reward function.
Figure 5.7b. Performance comparison of run with sparse reward function, broken down by opponent.

Smaller Policy

In this experiment, the size of all activations of the neural network is reduced by 2x, resulting in a policy with 4x fewer parameters (249,203 total parameters).

Figure 5.8a. Performance comparison of run with smaller neural network.
Figure 5.8b. Performance comparison of run with smaller neural network, broken down by opponent.

This work was originally inspired by the OpenAI Five​ [1]​ project, which trained DoTA 2 agents capable of beating top professional players and also made use of several statically defined domain randomizations to encourage exploration. AlphaStar​ [2]​ achieved a similar feat for the game of StarCraft II while employing a number of additional techniques, including an initial imitation-learning phase, a league of multiple policies that allows for robustness against a variety of strategies, and attention-based network architecture components.​ [8,9]​ use similar techniques and additional innovations to train agents for the Honor of Kings game.​ [10]​ achieves human-level performance in Honor of Kings with only supervised learning.​ [11]​ describes a large-scale distributed-systems architecture for competitive self-play based multi-agent reinforcement learning which is available as open-source.​ [12]​ proposes a set of constrained StarCraft II micro scenarios as benchmarks for cooperative multi-agent RL.

​[13]​ employs automatic domain randomization to achieve sim2real transfer by training on a growing volume of increasingly randomized physical simulations. ​[14]​ randomizes the payoff matrix in multi-agent games to make it possible to find policies that are capable of reciprocity and team formation. The approach taken is very similar to the one in this work in that it broadens the training distribution to be strictly more general than the target task, and policies are conditioned on a noisy version of the sampled ADR parameters and the uncertainty in the payoff matrix. ​[15]​ and ​[16]​ train multiple agents in adversarial setups to automatically generate a curriculum of increasingly difficult but still solvable tasks. ​[17]​ trains ViZDoom agents with an adaptive curriculum that shifts the distribution of levels in response to the performance of the agent. ​[18]​ trains Pommerman agents on a curriculum of increasingly difficult scripted opponents.

The application of transformer-based network architectures to spatial data has been studied in many previous works. ​[19]​ learns simulators for quantum mechanics, physical chemistry, and biochemistry with an architecture that is very similar to the object embedding and attention layer presented here. The main differences are that it stacks multiple layers of full self-attention between all objects, uses additive rather than scaled dot-product attention, and does not allow objects to attend to themselves. The graph neural network in​ [20]​ allows for multi-step local effect propagation and uses hierarchical propagation to efficiently capture interactions between more distant objects. ​[21]​ and ​[22]​ propose architectures for spatio-temporal modeling that attend to all objects and all previous timesteps.


Notably, using an omniscient value function appears to be essential for good performance in partially observable environments like CodeCraft. Curricula and careful network architecture design that encodes specific invariants enable large improvements to sample efficiency. Automatic domain randomizations (ADR) make it possible to sidestep many of the hard exploration problems that still pose a challenge to existing RL methods. The simplicity and effectiveness of ADR makes it an indispensable tool for reinforcement learning projects that target simulators. Particularly the automatic module cost balancing was key to the success of this project. Finding the correct unit composition is an especially difficult and non-stationary exploration problem since the optimal strategy depends not just on static aspects of the environment, but also on the counter-strategies that will be adapted by an intelligent opponent. The automatic module cost balancing solves this by inducing a natural curriculum that forces exploration of all unit types and allows for the gradual acquisition of the skills required to effectively control and counter different unit compositions. While the details of this ADR are specific to CodeCraft, it points to a general recipe that ought to be applicable to other complex adversarial exploration problems that are intractable for hand-designed curricula:

  1. Parameterize challenging aspects of the simulation so it is possible to continuously vary the difficulty/usefulness of the corresponding game mechanics.
  2. Use statistics from the environment to automatically adjust the distributions from which the parameter values are sampled to encourage policies to explore underutilized mechanics.
  3. Condition policies on the parameter values sampled for each simulation instance.

A recent blogpost​ [23]​ by BAIR provides an insightful perspective by framing reinforcement learning as a joint optimization problem over both a policy and a data distribution. Viewed through this lens, ADR, as used here, transforms the target data distribution into a more general data distribution in which local minima of the original data distribution are smoothly connected. This is somewhat analogous to the way continuous relaxation methods are used to efficiently find approximate solutions to discrete optimization problems.

One important insight is that randomizations have to be much more aggressive than might be naively assumed. During the first half of this project, I relied on randomizations that only sample from a small set of possibilities, and a major breakthrough occurred when I added randomizations that choose from very large sets of possibilities or continuously vary parameters of the environment. This is consistent with the findings of [24]​ and ​[25]​, who rigorously studied the effects of environment randomization and demonstrate that fairly high levels of randomness are required to obtain robust policies that generalize to the full distribution rather than just the specific environment instances seen during training. When trained in insufficiently diverse environments, policies will instead overfit to spurious correlations and fail to learn robust features​ [26]​. Of course, too high levels of randomness can also slow learning if it causes policies to spend most of the training in situations that have no applicability to the target task. Care is also required when applying visual data augmentations, which can prove too much for DNN to learn despite seeming light to a human eye.

There is an open question of whether the full application of similar ADRs could have benefited large-scale projects such as OpenAI Five or AlphaStar. One limitation of AlphaStar is that it requires an initial imitation learning stage where it learns from human games. Practical obstacles to parameterizing the StarCraft II game engine aside, I would expect that a map curriculum and automatically balanced randomization of unit and building costs would make it possible to train strong StarCraft II policies purely from self-play. Due to resource constraints, I have been unable to test this hypothesis so far.

A large fraction of RL research uses low-level visual input features as opposed to high-level features derived directly from simulator state, despite the much higher cost of training such policies. One reason is that images provide a fairly universal interface that makes it possible to benchmark a method and network architecture across a wide range of different environments, such as all Atari games. I posit that “lists of objects (in Euclidean space)” provides a similarly uniform abstraction across a possibly even wider range of simulated environments. One obstacle is that proven and reusable network architectures operating on lists/high-level state are not readily available. This is different for visual environments where networks such as IMPALA​ [27]​ have been evaluated across a wide range of tasks. The network architecture presented in this article works well enough on CodeCraft, but has not been applied to other domains. It still has known limitations such as its inability to process variable-length input sequences with full efficiency, quadratic scaling in the number of objects, and lack of stackable self-attention layers that can effectively utilize fine-grained spatial information. Further progress towards a reusable network architecture that can be easily plugged into many different simulators that expose their state as lists of objects could accelerate RL research by making experiments in complex environments much cheaper. If sufficiently capable, such an architecture would find many applications in the domains of commercial video games and scientific simulations.


Thanks to Anssi Kanervisto, Benedikt Winter, York Winter, and Wojciech Zaremba for reviewing drafts of this article. I am indebted to my colleagues at OpenAI for fostering a deeply intellectually stimulating environment in which I first encountered many of the ideas that allowed this work to succeed.


  1. [1]
    C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, and C. Hesse, Dota 2 with Large Scale Deep Reinforcement Learning, arXiv preprint arXiv:1912.06680.
  2. [2]
    O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, and P. Georgiev, Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning, Nature.
  3. [3]
    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal Policy Optimization Algorithms, arXiv preprint arXiv:1707.06347.
  4. [4]
    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, High-Dimensional Continuous Control Using Generalized Advantage Estimation, arXiv preprint arXiv:1506.02438.
  5. [5]
    A. F. Agarap, Deep Learning Using Rectified Linear Units (Relu), arXiv preprint arXiv:1803.08375.
  6. [6]
    J. L. Ba, J. R. Kiros, and G. E. Hinton, Layer Normalization, arXiv preprint arXiv:1607.06450.
  7. [7]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention Is All You Need, arXiv preprint arXiv:1706.03762.
  8. [8]
    D. Ye, Z. Liu, M. Sun, B. Shi, P. Zhao, H. Wu, H. Yu, S. Yang, X. Wu, and Q. Guo, Mastering Complex Control in Moba Games with Deep Reinforcement Learning, in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34 (2020), pp. 6672–6679.
  9. [9]
    D. Ye, G. Chen, W. Zhang, S. Chen, B. Yuan, B. Liu, J. Chen, Z. Liu, F. Qiu, and H. Yu, Towards Playing Full Moba Games with Deep Reinforcement Learning, arXiv preprint arXiv:2011.12692.
  10. [10]
    D. Ye, G. Chen, P. Zhao, F. Qiu, B. Yuan, W. Zhang, S. Chen, M. Sun, X. Li, and S. Li, Supervised Learning Achieves Human-Level Performance in MOBA Games: A Case Study of Honor of Kings, IEEE Transactions on Neural Networks and Learning Systems.
  11. [11]
    P. Sun, J. Xiong, L. Han, X. Sun, S. Li, J. Xu, M. Fang, and Z. Zhang, TLeague: A Framework for Competitive Self-Play Based Distributed Multi-Agent Reinforcement Learning, arXiv preprint arXiv:2011.12895.
  12. [12]
    M. Samvelyan, T. Rashid, C. S. De Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C.-M. Hung, P. H. Torr, J. Foerster, and S. Whiteson, The Starcraft Multi-Agent Challenge, arXiv preprint arXiv:1902.04043.
  13. [13]
    I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, and R. Ribas, Solving Rubik’s Cube with a Robot Hand, arXiv preprint arXiv:1910.07113.
  14. [14]
    B. Baker, Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences, arXiv preprint arXiv:2011.05373.
  15. [15]
    O. OpenAI, M. Plappert, R. Sampedro, T. Xu, I. Akkaya, V. Kosaraju, P. Welinder, R. D’Sa, A. Petron, and H. P. de O. Pinto, Asymmetric Self-Play for Automatic Goal Discovery in Robotic Manipulation, arXiv preprint arXiv:2101.04882.
  16. [16]
    M. Dennis, N. Jaques, E. Vinitsky, A. Bayen, S. Russell, A. Critch, and S. Levine, Emergent Complexity and Zero-Shot Transfer via Unsupervised Environment Design, arXiv preprint arXiv:2012.02096.
  17. [17]
    Y. Wu and Y. Tian, Training Agent for First-Person Shooter Game with Actor-Critic Curriculum Learning.
  18. [18]
    C. Gao, P. Hernandez-Leal, B. Kartal, and M. E. Taylor, Skynet: A Top Deep RL Agent in the Inaugural Pommerman Team Competition, arXiv preprint arXiv:1905.01360.
  19. [19]
    B. Chen, R. Barzilay, and T. Jaakkola, Path-Augmented Graph Transformer Network, arXiv preprint arXiv:1905.12712.
  20. [20]
    Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba, Learning Particle Dynamics for Manipulating Rigid Bodies, Deformable Objects, and Fluids, arXiv preprint arXiv:1810.01566.
  21. [21]
    C. Yu, X. Ma, J. Ren, H. Zhao, and S. Yi, Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction, in European Conference on Computer Vision (Springer, 2020), pp. 507–523.
  22. [22]
    M. A. Alcorn and A. Nguyen, Baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling, arXiv preprint arXiv:2102.03291.
  23. [23]
    B. Eysenbach, A. Kumar, and A. Gupta, Reinforcement Learning Is Supervised Learning on Optimized Data, The Berkeley Artificial Intelligence Research Blog.
  24. [24]
    K. Cobbe, C. Hesse, J. Hilton, and J. Schulman, Leveraging Procedural Generation to Benchmark Reinforcement Learning, in International Conference on Machine Learning (PMLR, 2020), pp. 2048–2056.
  25. [25]
    K. R. McKee, J. Z. Leibo, C. Beattie, and R. Everett, Quantifying Environment and Population Diversity in Multi-Agent Reinforcement Learning, arXiv preprint arXiv:2102.08370.
  26. [26]
    J. Hilton, N. Cammarata, S. Carter, G. Goh, and C. Olah, Understanding RL Vision, Distill.
  27. [27]
    L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, and I. Dunning, Impala: Scalable Distributed Deep-Rl with Importance Weighted Actor-Learner Architectures, in International Conference on Machine Learning (PMLR, 2018), pp. 1407–1416.


A Input Features

Global Properties

  • current timestep divided by maximum game length
  • total allied score, computed as the sum of resource cost of all allied drones. The resource cost of drones that have less than their full hitpoint values is scaled by $\frac{1}{2} \left( 1 + \frac{\text{hitpoints}}{\text{max_hitpoints}} \right)$.
  • width and height of the map
  • elapsed and remaining timesteps
  • module cost modifiers and mothership damage modifier described in Automatic Module Cost Balancing


The policy observes up to 15 allied drones and 15 enemy drones. The properties for enemy drones that were visible at some point but are now outside of the sight radius of any allies are reported as last known value. The value function observes the current state of all drones regardless of visibility. Any boolean features are set to either 1 or -1 depending on the truth value. The set of features is as follows:

  • x and y position
  • sine and cosine of orientation
  • the number of stored resources
  • whether the drone is building a drone
  • whether the drone is harvesting resources
  • whether the drone is stunned
  • whether the drone is an enemy
  • whether the drone is currently visible
  • number of hitpoints
  • number of storage, missile battery, constructor, engine, and shield modules
  • time elapsed since last visible
  • missile cooldown time


The policy observes up to 5 mineral crystal objects. The properties of mineral crystals obscured by the fog of war are given by the last known values. The set of features is as follows:

  • x and y position
  • size
  • whether it is being harvested by an ally


The “tile” features are a hack to allow non-recurrent policies to scout the map somewhat effectively. The entire map is divided into square tiles with a side length of 400. Each tile keeps track of the last time that an allied drone was located within the tile. Policies observe the 5 least recently visited tiles. Ties are broken randomly according to a second random stable ordering of the tiles. Each tile has the following features:

  • x and y position
  • number of timesteps since last visited, or maximum game length if never visited
  • whether the tile has ever been visited

B Grateful Flower Rollouts

Public Dropbox folder with raw .mkv files.

C Randomly Sampled Games

Public Dropbox folder with raw .mkv files.

D Scatter Connections and Cylindrical Convolution

Nothing described here is actually useful or enabled in my final implementation, but I had already written a section for it and was too emotionally attached to cut it completely. So now it’s an appendix!

Figure D1. Alternative network architecture with scatter connections and cylindrical convolution in parallel to the multi-head attention stack.
Figure D2. Downsample and scatter block.

Most predecessors of the current network architecture included scatter connections that place objects into buckets defined by a radial grid, optionally followed by a cylindrical convolution. In parallel to the multi-head attention, the network constructs an 8×8 grid of values from a partitioning of the map by 8 radial lines and concentric rings of diameter 60 centered at the position of the active drone. Each grid position has 4 channels, the values of which are given by the sum of the downsampled embeddings for all objects positioned within its region. The downsampling is performed by applying an 8 x 128 affine transformation to each object embedding, followed by a ReLU and LayerNorm:

$$\mathbf{z} = \text{LayerNorm}\left( \max\left( 0, \mathbf{\mathit{W}} \mathbf{x} + \mathbf{b} \right) \right) $$

The value associated with each of the 8×8 buckets is then given by the sum of downsampled embeddings of all the objects that are positioned within that bucket. Empty buckets will have a value of zero.

Figure D3. Visualization of the buckets for the spatial map of nearby objects. Along each ray, there are 7 buckets of width 60 and a final bucket for any objects at a distance of 420 or more.

The map of nearby objects recovers some of the performance you lose when using absolute rather than relative position features. At one point I had evidence that the scatter connections increase sample efficiency by a moderate amount even in combination with relative position features, but more recent experiments don’t indicate any benefits. I never saw any benefits from the cylindrical convolution. These components can still be enabled via the nearby_map and map_conv hyperparameters.

Random aside: The AlphaStar paper credits “scatter connections” for a large fraction of performance at the beginning of the paper and then fails to mention the term anywhere else. If you research the question of what “scatter connections” actually are, you will find that the authoritative source on the matter is this Reddit comment. I happen to know the author of the comment and he assures me that he received confirmation of his conjecture from some DeepMind folks at a NeurIPS booth. So I am quite confident in claiming that the operation described in this section is in fact a scatter connection. I spent a fair amount of time researching the correct citation format for this chain of evidence. My sources indicate that the established convention is “solid guy, private communication, late 2019”.

E Detailed Input Output Example

This section provides a detailed example of the different input features and outputs of a trained policy for the blue player in the scene shown below.

Input to the “Active Drone/Globals” network block. Features for all three allied drones concatenated with global properties including automatic domain randomization parameters. A quick look at the number of modules for each drone reveals that the drone at index 0 corresponds to the mothership in the bottom right, index 1 to the 1m1p drone in the top left, and index 2 to the small 1m drone.

 [ xpos, ypos, xangle, yangle, resources, is_constructing,
   is_harvesting, hitpoints, storages, missiles, constructors, engines,
   shields, is_stunned, is_enemy, last_xpos, last_ypos, is_visible,
   walldist_n90, walldist_n45, walldist_0, walldist_45, walldist_90, rel_timestep,
   allied_score,  map_height, map_width, timestep, remaining_timesteps, ms_damage,
   size1_mod, size2_mod, size3_mod, size4_mod, constructor_mod, storage_mod,
   shields_mod, missiles_mod, engines_mod]

[[ 1.11e+03, -5.73e+02, -6.27e-01, -7.79e-01,  2.00e+01, -1.00e+00,
   1.00e+00,  1.90e+01,  3.00e+00,  3.00e+00,  3.00e+00,  0.00e+00,
   1.00e+00, -1.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   1.44e+02,  1.13e+02,  1.78e+02,  9.32e+02,  1.00e+03,  6.28e-02,
   1.28e+02,  2.00e+03,  3.00e+03,  1.13e+03,  1.69e+04,  1.00e+00,
   1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,
   1.00e+00,  1.00e+00,  1.00e+00],
 [-8.27e+02,  5.52e+02,  8.47e-02, -9.96e-01,  0.00e+00, -1.00e+00,
  -1.00e+00,  1.30e+01,  0.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,
   1.00e+00, -1.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   1.73e+02,  2.68e+02,  1.00e+03,  1.00e+03,  1.00e+03,  6.28e-02,
   1.28e+02,  2.00e+03,  3.00e+03,  1.13e+03,  1.69e+04,  1.00e+00,
   1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,
   1.00e+00,  1.00e+00,  1.00e+00],
 [-5.06e+02,  4.55e+02, -3.77e-01, -9.26e-01,  0.00e+00, -1.00e+00,
  -1.00e+00,  2.00e+00,  0.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00, -1.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   5.33e+02,  5.36e+02,  1.00e+03,  1.00e+03,  1.00e+03,  6.28e-02,
   1.28e+02,  2.00e+03,  3.00e+03,  1.13e+03,  1.69e+04,  1.00e+00,
   1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,  1.00e+00,
   1.00e+00,  1.00e+00,  1.00e+00]]

Input to the “Allied Drones” network block. Identical to above, but does not include global features.

[[ 1.11e+03, -5.73e+02, -6.27e-01, -7.79e-01,  2.00e+01, -1.00e+00,
   1.00e+00,  1.90e+01,  3.00e+00,  3.00e+00,  3.00e+00,  0.00e+00,
   1.00e+00, -1.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   1.44e+02,  1.13e+02,  1.78e+02,  9.32e+02,  1.00e+03],
 [-8.27e+02,  5.52e+02,  8.47e-02, -9.96e-01,  0.00e+00, -1.00e+00,
  -1.00e+00,  1.30e+01,  0.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,
   1.00e+00, -1.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   1.73e+02,  2.68e+02,  1.00e+03,  1.00e+03,  1.00e+03],
 [-5.06e+02,  4.55e+02, -3.77e-01, -9.26e-01,  0.00e+00, -1.00e+00,
  -1.00e+00,  2.00e+00,  0.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00, -1.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   5.33e+02,  5.36e+02,  1.00e+03,  1.00e+03,  1.00e+03]]

Input to the “Enemy Drones” network block. Features are identical to features for allied drones, though features for drones not currently within an ally’s sight radius would only report the last known value. In this case, all drones are currently visible, and the four entries correspond to the enemy cluster in the top left. The enemy drone in the bottom left has not been spotted yet.

[[-1.26e+03,  6.85e+02, -9.64e-01,  2.67e-01,  2.00e+00, -1.00e+00,
  -1.00e+00,  4.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00, -1.00e+00, -1.00e+00,  0.00e+00,  3.00e+01,  1.00e+00,
   8.45e+02,  9.36e+02,  1.00e+03,  1.00e+03,  1.00e+03],
 [-1.22e+03,  4.66e+02,  6.36e-01,  7.72e-01,  0.00e+00, -1.00e+00,
  -1.00e+00,  3.00e+00,  0.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00, -1.00e+00, -1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   2.90e+02,  2.25e+02,  3.53e+02,  1.00e+03,  1.00e+03],
 [-1.21e+03,  2.75e+02,  1.42e-01, -9.90e-01,  1.00e+00, -1.00e+00,
  -1.00e+00,  1.90e+01,  3.00e+00,  3.00e+00,  3.00e+00,  0.00e+00,
   1.00e+00,  1.00e+00, -1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   1.00e+03,  1.00e+03,  1.00e+03,  2.68e+02,  2.17e+02],
 [-1.07e+03,  3.04e+02, -3.38e-01, -9.41e-01,  0.00e+00, -1.00e+00,
  -1.00e+00,  4.00e+00,  0.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00,  1.00e+00, -1.00e+00,  0.00e+00,  0.00e+00,  1.00e+00,
   1.00e+03,  1.00e+03,  1.00e+03,  1.75e+02,  7.93e+01]]

Input to the “All Enemy Drones” network block used by the value function (which uses the same weights as the “Enemy Drones” network). Identical to above, except for the ordering and inclusion of one additional enemy drone that is not visible to the player.

[[-1.26e+03,  6.85e+02, -9.64e-01,  2.67e-01,  2.00e+00, -1.00e+00,
  -1.00e+00,  4.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00-1.00e+00, -1.00e+00,  1.13e+03,  3.00e+01,  1.00e+00,
   8.45e+02,  9.36e+02,  1.00e+03,  1.00e+03,  1.00e+03],
 [-1.18e+03, -5.05e+02,  6.28e-02, -9.98e-01,  2.00e+00, -1.00e+00,
   1.00e+00,  4.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00, -1.00e+00, -1.00e+00,  1.13e+03,  3.00e+01,  1.00e+00,
   1.00e+03,  1.00e+03,  9.97e+02,  2.36e+02,  1.78e+02],
 [-1.21e+03,  2.75e+02,  1.42e-01, -9.90e-01,  1.00e+00, -1.00e+00,
  -1.00e+00,  1.90e+01,  3.00e+00,  3.00e+00,  3.00e+00,  0.00e+00,
   1.00e+00,  1.00e+00, -1.00e+00,  1.13e+03,  0.00e+00,  1.00e+00,
   1.00e+03,  1.00e+03,  1.00e+03,  2.68e+02,  2.17e+02],
 [-1.07e+03,  3.04e+02, -3.38e-01, -9.41e-01,  0.00e+00, -1.00e+00,
  -1.00e+00,  4.00e+00,  0.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00,  1.00e+00, -1.00e+00,  1.13e+03,  0.00e+00,  1.00e+00,
   1.00e+03,  1.00e+03,  1.00e+03,  1.75e+02,  7.93e+01],
 [-1.22e+03,  4.66e+02,  6.36e-01,  7.72e-01,  0.00e+00, -1.00e+00,
  -1.00e+00,  3.00e+00,  0.00e+00,  1.00e+00,  0.00e+00,  0.00e+00,
   0.00e+00, -1.00e+00, -1.00e+00,  1.13e+03,  0.00e+00,  1.00e+00,
   2.90e+02,  2.25e+02,  3.53e+02,  1.00e+03,  1.00e+03]]

Input to the “Mineral Crystals” network block. The entries correspond to the large minerals in the top left and bottom right, and the group of 3 smaller minerals in the top right.

 [xpos, ypos, size, claimed]

[[ 1.07e+03, -6.09e+02,  1.24e+02,  0.00e+00],
 [ 1.17e+03,  5.69e+02,  3.00e+01,  1.00e+00],
 [ 1.07e+03,  5.86e+02,  5.50e+01,  1.00e+00],
 [ 1.12e+03,  7.11e+02,  7.00e+00,  1.00e+00],
 [-1.07e+03,  6.09e+02,  1.30e+02,  1.00e+00]]

Input to the “Tiles” network block. Location of five tiles at various locations of the map, none of which have been scouted yet.

 [xpos, ypos, time_since_last_visited, visited_before]

[[-1.30e+03,  4.00e+02,  1.80e+04, -1.00e+00],
 [-1.00e+02, -8.00e+02,  1.80e+04, -1.00e+00],
 [ 7.00e+02,  8.00e+02,  1.80e+04, -1.00e+00],
 [-1.00e+02, -4.00e+02,  1.80e+04, -1.00e+00],
 [-9.00e+02,  8.00e+02,  1.80e+04, -1.00e+00]]

Input to the “Relative Positions” network block. Flattened list of relative positions for each drone controlled by the policy to every other object. The 30 relative positions with respect to all minerals and tiles are omitted for the sake of brevity.

 [ xpos, ypos, sqrt_distance ]

 # Allies
 # drone 0
 [ 0.00e+00,  0.00e+00,  0.00e+00],
 [ 1.52e-01, -9.88e-01,  4.73e+01],
 [ 1.12e-01, -9.94e-01,  4.38e+01],
 # drone 1
 [ 5.73e-01,  8.19e-01,  4.73e+01],
 [ 0.00e+00,  0.00e+00,  0.00e+00],
 [ 3.67e-01,  9.30e-01,  1.83e+01],
 # drone 2
 [ 1.78e-01,  9.84e-01,  4.38e+01],
 [ 9.55e-02, -9.95e-01,  1.83e+01],
 [ 0.00e+00,  0.00e+00,  0.00e+00],
 # Enemies
 # drone 0
 [ 1.89e-01, -9.82e-01,  5.18e+01],
 [ 2.57e-01, -9.66e-01,  5.06e+01],
 [ 3.23e-01, -9.46e-01,  4.98e+01],
 [ 2.93e-01, -9.56e-01,  4.85e+01],
 # drone 1
 [-3.77e-01, -9.26e-01,  2.12e+01],
 [ 1.27e-01, -9.92e-01,  2.02e+01],
 [ 5.10e-01, -8.60e-01,  2.18e+01],
 [ 6.45e-01, -7.64e-01,  1.87e+01],
 # drone 2
 [ 8.97e-02, -9.96e-01,  2.80e+01],
 [ 3.64e-01, -9.32e-01,  2.68e+01],
 [ 5.94e-01, -8.04e-01,  2.70e+01],
 [ 6.03e-01, -7.98e-01,  2.43e+01],
 # Minerals [...]
 # Tiles [...]

Output probabilities of the policy network. Drone 0 assigns a probability of 99.3% to the second-to-last action, which corresponds to building a new drone of type 2m1e1p. Drone 1 will either turn left or move left with a probability of 84%, perhaps to retreat from the group of enemies to its right. Similarly, drone 2 assigns these two movement actions a probability of 88.4%.

 [move_left, move_forward, move_right, turn_left, stay, turn_right, build_1m,
  NOT_USED, build1s1c, build2m, build1m1p, build3m1p, build2m2p, build2s1m1c,
  build2s2c, build2s1c1e, build2m1e1p, build1s]]

[[0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 4.74e-03, 0.00e+00, 1.43e-07,
  0.00e+00, 1.94e-07, 1.63e-06, 1.95e-05, 1.81e-03, 3.19e-04, 1.98e-06,
  1.60e-06, 2.34e-06, 9.93e-01, 3.80e-09],
 [3.51e-01, 1.52e-02, 1.12e-01, 4.89e-01, 1.03e-02, 2.28e-02, 0.00e+00,
  0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
  0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00],
 [6.37e-01, 4.11e-02, 5.26e-02, 2.47e-01, 1.57e-02, 6.19e-03, 0.00e+00,
  0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
  0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]]

The value function outputs a value of 1.04, which can be interpreted as an estimated win probability of just over 50% since the sum total reward for this agent is +3 for winning the game and -1 for losing the game.

F Environmental Impact Assessment

According to nvidia-smi, each of my four GPUs consumes around 100W during typical training runs. The 2990WX processor running at 40% load on all cores might add another 250W. My utility is PG&E and claims an emission rate of 0.524 lbs CO2/kWh. Rounding up to 1kW of total power consumption and very conservatively assuming two years of continuous operation gives us 1kW x 2 years x 0.524 lbs CO2/kWh = 9,180lbs CO2, all of which have been fully offset.

G Statically Randomized Module Cost

To determine whether the automatic module cost randomization is actually necessary, I implemented a simpler domain randomization scheme in DeepCodeCraft@d06bddc. The drone size and module cost modifiers are drawn from a lognormal distribution. The variance of the log of the distribution is set to 1 at the beginning of training and linearly decayed to zero. To prevent very cheap drone costs, all drone cost size modifiers and all module cost modifiers are rescaled so that the smallest modifier is 0.5 if necessary. Additionally, the experiment used CodeCraftServer@ff985aa which disallows building additional drones during rollouts once the number of controllable game units is reached (15). This was necessary to prevent degenerate games with more drones than fit into the game area, which made CodeCraft’s physics/collision detection engine very sad. With statically randomized module costs, performance seems to be even worse than without any module cost randomization, and despite fairly high levels of randomization the unit composition still collapses. I only made a single attempt and there may well be a different or tweaked version of this randomization that would be more successful.

Figure G1. Fraction of each drone type produced during training for “Crisp Moon” (best performing run with static module cost randomization schedule), “Dainty Shape” (module cost ADR enabled), and “Kind Morning” (module cost ADR disabled).
Figure G2. Performance comparison of run with static module cost randomization schedule.
Figure G3. Performance comparison of run with static module cost randomization schedule, broken down by opponent.

2 thoughts on “Mastering Real-Time Strategy Games with Deep Reinforcement Learning: Mere Mortal Edition

  1. Alan

    Might be a silly question, but for the object concatenation, how to you get 30 x 128 at the end, when the inputs are 15 x 64, 15 x 64, 5 x 64, 5 x 64, then 30 x 64 for the relative positions? Doesn’t really add up.

    1. clemenswinter Post author

      You are correct, this should actually be 40 x 128 and 40 x 64. I’ve updated the article with fixed diagrams.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.